Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining...

60
Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007

Transcript of Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining...

Page 1: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Ch. 8: Web Crawling

By Filippo MenczerIndiana University School of Informatics

in Web Data Mining by Bing Liu Springer, 2007

Page 2: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Outline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative,

federated crawlers

Page 3: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Q: How does a search engine know that all these pages

contain the query terms?

A: Because all of those pages have been

crawled

Page 4: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Crawler:

basic idea

starting pages

(seeds)

Page 5: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Many names

• Crawler• Spider• Robot (or bot)• Web agent• Wanderer, worm, …• And famous instances: googlebot,

scooter, slurp, msnbot, …

Page 6: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Googlebot & you

Page 7: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Motivation for crawlers• Support universal search engines (Google,

Yahoo, MSN/Windows Live, Ask, etc.)• Vertical (specialized) search engines, e.g.

news, shopping, papers, recipes, reviews, etc.

• Business intelligence: keep track of potential competitors, partners

• Monitor Web sites of interest• Evil: harvest emails for spamming,

phishing…• … Can you think of some others?…

Page 8: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

A crawler within a search engine

Web

Text index PageRank

Page repository

googlebot

Text & link analysisQuery

hits

Ranker

Page 9: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

One taxonomy of crawlers

Universal crawlers

Focused crawlers

Evolutionary crawlers Reinforcement learning crawlers

etc...

Adaptive topical crawlers

Best-first PageRank

etc...

Static crawlers

Topical crawlers

Preferential crawlers

Crawlers

• Many other criteria could be used:– Incremental, Interactive, Concurrent, Etc.

Page 10: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Outline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative,

federated crawlers

Page 11: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Basic crawlers

• This is a sequential crawler

• Seeds can be any list of starting URLs

• Order of page visits is determined by frontier data structure

• Stop criterion can be anything

Page 12: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Graph traversal

(BFS or DFS?)• Breadth First Search

– Implemented with QUEUE (FIFO) – Finds pages along shortest paths– If we start with “good” pages,

this keeps us close; maybe other good stuff…

• Depth First Search– Implemented with STACK (LIFO)– Wander away (“lost in

cyberspace”)

Page 13: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

A basic crawler in Perl • Queue: a FIFO list (shift and push)

my @frontier = read_seeds($file);while (@frontier && $tot < $max) {

my $next_link = shift @frontier;my $page = fetch($next_link);add_to_index($page);my @links = extract_links($page, $next_link);push @frontier, process(@links);

}

Page 14: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Implementation issues• Don’t want to fetch same page twice!

– Keep lookup table (hash) of visited pages– What if not visited but in frontier already?

• The frontier grows very fast!– May need to prioritize for large crawls

• Fetcher must be robust!– Don’t crash if download fails– Timeout mechanism

• Determine file type to skip unwanted files– Can try using extensions, but not reliable– Can issue ‘HEAD’ HTTP commands to get Content-Type

(MIME) headers, but overhead of extra Internet requests

Page 15: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More implementation issues

• Fetching– Get only the first 10-100 KB per page– Take care to detect and break

redirection loops– Soft fail for timeout, server not

responding, file not found, and other errors

Page 16: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More implementation issues: Parsing• HTML has the structure of a DOM

(Document Object Model) tree• Unfortunately actual HTML is

often incorrect in a strict syntactic sense

• Crawlers, like browsers, must be robust/forgiving

• Fortunately there are tools that can help– E.g. tidy.sourceforge.net

• Must pay attention to HTML entities and unicode in text

• What to do with a growing number of other formats?– Flash, SVG, RSS, AJAX…

Page 17: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More implementation issues• Stop words

– Noise words that do not carry meaning should be eliminated (“stopped”) before they are indexed

– E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc…– Typically syntactic markers– Typically the most common terms– Typically kept in a negative dictionary

• 10–1,000 elements• E.g. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words

– Parser can detect these right away and disregard them

Page 18: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More implementation issuesConflation and thesauri• Idea: improve recall by merging words with

same meaning1. We want to ignore superficial morphological

features, thus merge semantically similar tokens– {student, study, studying, studious} => studi

2. We can also conflate synonyms into a single form using a thesaurus– 30-50% smaller index– Doing this in both pages and queries allows to

retrieve pages about ‘automobile’ when user asks for ‘car’

– Thesaurus can be implemented as a hash table

Page 19: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More implementation issues• Stemming

– Morphological conflation based on rewrite rules– Language dependent!– Porter stemmer very popular for English

• http://www.tartarus.org/~martin/PorterStemmer/

• Context-sensitive grammar rules, eg:– “IES” except (“EIES” or “AIES”) --> “Y”

• Versions in Perl, C, Java, Python, C#, Ruby, PHP, etc.– Porter has also developed Snowball, a language to create

stemming algorithms in any language• http://snowball.tartarus.org/• Ex. Perl modules: Lingua::Stem and Lingua::Stem::Snowball

Page 20: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More implementation issues• Static vs. dynamic pages

– Is it worth trying to eliminate dynamic pages and only index static pages?

– Examples:• http://www.census.gov/cgi-bin/gazetteer• http://informatics.indiana.edu/research/colloquia.asp• http://www.amazon.com/exec/obidos/subst/home/home.html/002-8332429-6490452

• http://www.imdb.com/Name?Menczer,+Erico• http://www.imdb.com/name/nm0578801/

– Why or why not? How can we tell if a page is dynamic? What about ‘spider traps’?

– What do Google and other search engines do?

Page 21: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More implementation issues• Relative vs. Absolute URLs

– Crawler must translate relative URLs into absolute URLs

– Need to obtain Base URL from HTTP header, or HTML Meta tag, or else current page path by default

– Examples• Base: http://www.cnn.com/linkto/

• Relative URL: intl.html• Absolute URL: http://www.cnn.com/linkto/intl.html

• Relative URL: /US/• Absolute URL: http://www.cnn.com/US/

Page 22: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More implementation issues• URL canonicalization

– All of these:• http://www.cnn.com/TECH• http://WWW.CNN.COM/TECH/• http://www.cnn.com:80/TECH/• http://www.cnn.com/bogus/../TECH/

– Are really equivalent to this canonical form:• http://www.cnn.com/TECH/

– In order to avoid duplication, the crawler must transform all URLs into canonical form

– Definition of “canonical” is arbitrary, e.g.:• Could always include port• Or only include port when not default :80

Page 23: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More on Canonical URLs• Some transformation are trivial, for example:

http://informatics.indiana.edu http://informatics.indiana.edu/

http://informatics.indiana.edu/index.html#fragment http://informatics.indiana.edu/index.html

http://informatics.indiana.edu/dir1/./../dir2/ http://informatics.indiana.edu/dir2/

http://informatics.indiana.edu/%7Efil/ http://informatics.indiana.edu/~fil/

http://INFORMATICS.INDIANA.EDU/fil/ http://informatics.indiana.edu/fil/

Page 24: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More on Canonical URLsOther transformations require heuristic assumption

about the intentions of the author or configuration of the Web server:

1. Removing default file name http://informatics.indiana.edu/fil/index.html http://informatics.indiana.edu/fil/– This is reasonable in general but would be wrong in

this case because the default happens to be ‘default.asp’ instead of ‘index.html’

2. Trailing directory http://informatics.indiana.edu/fil http://informatics.indiana.edu/fil/– This is correct in this case but how can we be sure in

general that there isn’t a file named ‘fil’ in the root dir?

Page 25: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More implementation issues• Spider traps

– Misleading sites: indefinite number of pages dynamically generated by CGI scripts

– Paths of arbitrary depth created using soft directory links and path rewriting features in HTTP server

– Only heuristic defensive measures:• Check URL length; assume spider trap above some

threshold, for example 128 characters• Watch for sites with very large number of URLs• Eliminate URLs with non-textual data types• May disable crawling of dynamic pages, if can

detect

Page 26: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More implementation issues• Page repository

– Naïve: store each page as a separate file• Can map URL to unique filename using a hashing

function, e.g. MD5• This generates a huge number of files, which is

inefficient from the storage perspective– Better: combine many pages into a single large file,

using some XML markup to separate and identify them

• Must map URL to {filename, page_id}– Database options

• Any RDBMS -- large overhead• Light-weight, embedded databases such as Berkeley

DB

Page 27: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Concurrency

• A crawler incurs several delays:– Resolving the host name in the URL to

an IP address using DNS– Connecting a socket to the server and

sending the request– Receiving the requested page in

response• Solution: Overlap the above delays

by fetching many pages concurrently

Page 28: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Architecture of a

concurrent crawler

Page 29: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Concurrent crawlers• Can use multi-processing or multi-

threading• Each process or thread works like a

sequential crawler, except they share data structures: frontier and repository

• Shared data structures must be synchronized (locked for concurrent writes)

• Speedup of factor of 5-10 are easy this way

Page 30: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Outline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative,

federated crawlers

Page 31: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Universal crawlers

• Support universal search engines• Large-scale• Huge cost (network bandwidth) of

crawl is amortized over many queries from users

• Incremental updates to existing index and other data repositories

Page 32: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Large-scale universal crawlers

• Two major issues:1. Performance

• Need to scale up to billions of pages

2. Policy• Need to trade-off coverage,

freshness, and bias (e.g. toward “important” pages)

Page 33: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Large-scale crawlers: scalability• Need to minimize overhead of DNS lookups• Need to optimize utilization of network

bandwidth and disk throughput (I/O is bottleneck)

• Use asynchronous sockets– Multi-processing or multi-threading do not scale up

to billions of pages– Non-blocking: hundreds of network connections

open simultaneously– Polling socket to monitor completion of network

transfers

Page 34: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

High-level architecture of a

scalable universal crawler

Several parallel queues to spread load across servers (keep

connections alive)

DNS server using UDP (less overhead than

TCP), large persistent in-memory cache, and

prefetching

Optimize use of network bandwidth

Optimize disk I/O throughputHuge farm of crawl machines

Page 35: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Universal crawlers: Policy• Coverage

– New pages get added all the time– Can the crawler find every page?

• Freshness– Pages change over time, get removed, etc.– How frequently can a crawler revisit ?

• Trade-off!– Focus on most “important” pages (crawler

bias)?– “Importance” is subjective

Page 36: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Maintaining a “fresh” collection

• Universal crawlers are never “done”• High variance in rate and amount of page

changes• HTTP headers are notoriously unreliable

– Last-modified– Expires

• Solution– Estimate the probability that a previously visited

page has changed in the meanwhile– Prioritize by this probability estimate

Page 37: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Estimating page change rates

• Algorithms for maintaining a crawl in which most pages are fresher than a specified epoch– Brewington & Cybenko; Cho, Garcia-Molina &

Page

• Assumption: recent past predicts the future (Ntoulas, Cho & Olston 2004)– Frequency of change not a good predictor– Degree of change is a better predictor

Page 38: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Do we need to crawl the entire Web?

• If we cover too much, it will get stale• There is an abundance of pages in the Web• For PageRank, pages with very low prestige are

largely useless• What is the goal?

– General search engines: pages with high prestige – News portals: pages that change often– Vertical portals: pages on some topic

• What are appropriate priority measures in these cases? Approximations?

Page 39: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Breadth-first crawlers

• BF crawler tends to crawl high-PageRank pages very early

• Therefore, BF crawler is a good baseline to gauge other crawlers

• But why is this so?

Najork and Weiner 2001

Page 40: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Bias of breadth-first crawlers• The structure of the

Web graph is very different from a random network

• Power-law distribution of in-degree

• Therefore there are hub pages with very high PR and many incoming links

• These are attractors: you cannot avoid them!

Page 41: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Outline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative,

federated crawlers

Page 42: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Preferential crawlers• Assume we can estimate for each page an

importance measure, I(p)• Want to visit pages in order of decreasing I(p)• Maintain the frontier as a priority queue sorted

by I(p)• Possible figures of merit:

– Precision ~ | p: crawled(p) & I(p) > threshold | / | p: crawled(p) |

– Recall ~| p: crawled(p) & I(p) > threshold | / | p: I(p) > threshold |

Page 43: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Preferential crawlers• Selective bias toward some pages, eg. most

“relevant”/topical, closest to seeds, most popular/largest PageRank, unknown servers, highest rate/amount of change, etc…

• Focused crawlers– Supervised learning: classifier based on labeled examples

• Topical crawlers– Best-first search based on similarity(topic, parent)– Adaptive crawlers

• Reinforcement learning• Evolutionary algorithms/artificial life

Page 44: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Preferential crawling algorithms: Examples

• Breadth-First– Exhaustively visit all links in order encountered

• Best-N-First– Priority queue sorted by similarity, explore top N at a time– Variants: DOM context, hub scores

• PageRank– Priority queue sorted by keywords, PageRank

• SharkSearch– Priority queue sorted by combination of similarity, anchor text,

similarity of parent, etc. (powerful cousin of FishSearch)

• InfoSpiders– Adaptive distributed algorithm using an evolving population of learning

agents

Page 45: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Topical crawlers• All we have is a topic (query,

description, keywords) and a set of seed pages (not necessarily relevant)

• No labeled examples• Must predict relevance of unvisited links

to prioritize• Original idea: Menczer 1997, Menczer &

Belew 1998

Page 46: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Outline

• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative,

federated crawlers

Page 47: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Evaluation of topical crawlers

• Goal: build “better” crawlers to support applications (Srinivasan & al. 2005)

• Build an unbiased evaluation framework– Define common tasks of measurable difficulty– Identify topics, relevant targets– Identify appropriate performance measures

• Effectiveness: quality of crawler pages, order, etc.• Efficiency: separate CPU & memory of crawler

algorithms from bandwidth & common utilities

Page 48: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Outline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments: social, collaborative,

federated crawlers

Page 49: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Crawler ethics and conflicts• Crawlers can cause trouble, even

unwillingly, if not properly designed to be “polite” and “ethical”

• For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack!– Server administrator and users will be upset– Crawler developer/admin IP address may be

blacklisted

Page 50: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Crawler etiquette (important!)

• Identify yourself– Use ‘User-Agent’ HTTP header to identify crawler, website with

description of crawler and contact information for crawler developer

– Use ‘From’ HTTP header to specify crawler developer email– Do not disguise crawler as a browser by using their ‘User-Agent’

string

• Always check that HTTP requests are successful, and in case of error, use HTTP error code to determine and immediately address problem

• Pay attention to anything that may lead to too many requests to any one server, even unwillingly, e.g.:– redirection loops– spider traps

Page 51: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Crawler etiquette (important!)

• Spread the load, do not overwhelm a server– Make sure that no more than some max. number of

requests to any single server per unit time, say < 1/second

• Honor the Robot Exclusion Protocol– A server can specify which parts of its document tree any

crawler is or is not allowed to crawl by a file named ‘robots.txt’ placed in the HTTP root directory, e.g. http://www.indiana.edu/robots.txt

– Crawler should always check, parse, and obey this file before sending any requests to a server

– More info at:• http://www.google.com/robots.txt• http://www.robotstxt.org/wc/exclusion.html

Page 52: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More on robot exclusion

• Make sure URLs are canonical before checking against robots.txt

• Avoid fetching robots.txt for each request to a server by caching its policy as relevant to this crawler

• Let’s look at some examples to understand the protocol…

Page 53: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

www.apple.com/robots.txt

# robots.txt for http://www.apple.com/

User-agent: *Disallow:

All crawlers…

…can go anywhere!

Page 54: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

www.microsoft.com/robots.txt

# Robots.txt file for http://www.microsoft.com

User-agent: *Disallow: /canada/Library/mnp/2/aspx/Disallow: /communities/bin.aspxDisallow: /communities/eventdetails.mspxDisallow: /communities/blogs/PortalResults.mspxDisallow: /communities/rss.aspxDisallow: /downloads/Browse.aspxDisallow: /downloads/info.aspxDisallow: /france/formation/centres/planning.aspDisallow: /france/mnp_utility.mspxDisallow: /germany/library/images/mnp/Disallow: /germany/mnp_utility.mspxDisallow: /ie/ie40/Disallow: /info/customerror.htmDisallow: /info/smart404.aspDisallow: /intlkb/Disallow: /isapi/#etc…

All crawlers…

…are not allowed in

these paths…

Page 55: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

www.springer.com/robots.txt# Robots.txt for http://www.springer.com (fragment)

User-agent: GooglebotDisallow: /chl/*Disallow: /uk/*Disallow: /italy/*Disallow: /france/*

User-agent: slurpDisallow:Crawl-delay: 2

User-agent: MSNBotDisallow:Crawl-delay: 2

User-agent: scooterDisallow:

# all othersUser-agent: *Disallow: /

Google crawler is allowed everywhere except these paths

Yahoo and MSN/Windows Live

are allowed everywhere but

should slow down

AltaVista has no limits

Everyone else keep off!

Page 56: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More crawler ethics issues• Is compliance with robot exclusion a

matter of law? – No! Compliance is voluntary, but if you do

not comply, you may be blocked– Someone (unsuccessfully) sued Internet

Archive over a robots.txt related issue

• Some crawlers disguise themselves– Using false User-Agent – Randomizing access frequency to look like a

human/browser– Example: click fraud for ads

Page 57: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

More crawler ethics issues

• Servers can disguise themselves, too– Cloaking: present different content based on

User-Agent– E.g. stuff keywords on version of page

shown to search engine crawler– Search engines do not look kindly on this

type of “spamdexing” and remove from their index sites that perform such abuse

• Case of bmw.de made the news

Page 58: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Outline• Motivation and taxonomy of crawlers• Basic crawlers and implementation issues• Universal crawlers• Preferential (focused and topical) crawlers• Evaluation of preferential crawlers• Crawler ethics and conflicts• New developments

Page 59: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

New developments: social, collaborative, federated crawlers

• Idea: go beyond the “one-fits-all” model of centralized search engines

• Extend the search task to anyone, and distribute the crawling task

• Each search engine is a peer agent• Agents collaborate by routing

queries and results

Page 60: Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007.

Slides © 2007 Filippo Menczer, Indiana University School of InformaticsBing Liu: Web Data Mining. Springer, 2007

Ch. 8 Web Crawling by Filippo Menczer

Need crawling code?• Reference C implementation of HTTP, HTML parsing, etc

– w3c-libwww package from World-Wide Web Consortium: www.w3c.org/Library/

• LWP (Perl)– http://www.oreilly.com/catalog/perllwp/– http://search.cpan.org/~gaas/libwww-perl-5.804/

• Open source crawlers/search engines– Nutch: http://www.nutch.org/ (Jakarta Lucene:

jakarta.apache.org/lucene/)– Heretrix: http://crawler.archive.org/– WIRE: http://www.cwr.cl/projects/WIRE/– Terrier: http://ir.dcs.gla.ac.uk/terrier/

• Open source topical crawlers, Best-First-N (Java)– http://informatics.indiana.edu/fil/IS/JavaCrawlers/

• Evaluation framework for topical crawlers (Perl)– http://informatics.indiana.edu/fil/IS/Framework/