Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

48
Claudio Scordino Ph.D. Student Crawling the Crawling the Web: Web: problems and techniques problems and techniques Computer Science Department - University of Pisa Computer Science Department - University of Pisa May 2004

Transcript of Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Page 1: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Claudio ScordinoPh.D. Student

Crawling the Web:Crawling the Web:problems and techniquesproblems and techniques

Computer Science Department - University of PisaComputer Science Department - University of Pisa

May 2004

Page 2: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

OutlineOutline

• Introduction

• Crawler architectures

- Increasing the throughput

• What pages we do not want to fetch

- Spider traps

- Duplicates

- Mirrors

Page 3: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

IntroductionIntroduction

Job of a crawler (or spider): fetching the Web pages to a computer where they will be analyzed

The algorithm is conceptually simple, but…

…it’s a complex and underestimate activity

Page 4: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Famous CrawlersFamous Crawlers

• Mercator (Compaq, Altavista)

Java

Modular (components loaded dynamically)

Priority-based scheduling for URLs downloads

- The algorithm is a pluggable component

Different processing modules for different contents

Checkpointing

- Allows the crawler to recover its state after a failure

- In a distributed crawler is performed by the Queen

Page 5: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Famous CrawlersFamous Crawlers

• GoogleBot (Stanford, Google)

C/C++

• WebBase (Stanford)

• HiWE: Hidden Web Exposer (Stanford)

• Heritrix (Internet Archive)

http://www.crawler.archive.org/

Page 6: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Famous CrawlersFamous Crawlers

• Sphinx Java

Visual and interactive environment

Relocatable: capable of executing on a remote host

Site-specific

- Customizable crawling

- Classifiers: site-specific content analyzers

1. Links to follow

2. Parts to process

- Not scalable

Page 7: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Crawler ArchitectureCrawler Architecture

Load Monitor

SCHEDULER

CrawlMetadata

DuplicateURL

Eliminator

URL Filter

Hosts

HREFsextractor

and normalizer

PARSER

Internet

Internet

seed URLs

URL FRONTIER

Citations

RETRIEVERS

DNS HTTP

Page 8: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Web masters annoyedWeb masters annoyed

Web Server administrators could be annoyed by:

1. Server overload

- Solution: per-server queues

2. Fetching of private pages

- Solution: Robot Exclusion Protocol

- File: /robots.txt

Page 9: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Crawler ArchitectureCrawler Architecture

Per-serverqueues

Robots

Page 10: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Mercator’s schedulerMercator’s scheduler

BACK-END: ensures politeness

(no server overload)

FRONT-END: prioritizes URLs

with a value between 1 and k

Queues containing URLs of only a single host

Specifieswhen aserver maybe contactedagain

Page 11: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Increasing the Increasing the throughputthroughput

Possible levels of parallelization:

Parallelize the process to fetch many pages at the same time (~thousands per second).

DNS HTTP Parsing

Page 12: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Domain Name resolutionDomain Name resolution

Problem: DNS requires time to resolve the server hostname

Page 13: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Domain Name resolutionDomain Name resolution

1.Asynchronous DNS resolver:

• Concurrent handling of multiple outstanding

requests

• Not provided by most UNIX implementations of

gethostbyname

• GNU ADNS library

• http://www.chiark.greenend.org.uk/~ian/adns/

• Mercator reduced the thread’s elapsed time from

87% to 25%

Page 14: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Domain Name resolutionDomain Name resolution

2.Customized DNS component:

• Caching server with persistent cache largely

residing in memory

• Prefetching

• Hostnames extracted by HREFs and requests

made to the caching server

• Does not wait for resolution to be completed

Page 15: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Crawler ArchitectureCrawler Architecture

Per-serverqueues

Robots

AsyncDNS

prefetchDNS

Cache

DNS resolver

client

Page 16: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Page retrievalPage retrieval

1.Multithreading

• Blocking system calls (synchronous I/O)

• pthreads multithreading library

• Used in Mercator, Sphinx, WebRace

• Sphinx uses a monitor to determine the optimal number of threads at runtime

• Mutual exclusion overhead

Problem: HTTP requires time to fetch a page

Page 17: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Page retrievalPage retrieval

2.Asynchronous sockets

• not blocking the process/thread

• select monitors several sockets at the same time

• Does not need mutual exclusion since it performs a serialized completion of threads (i.e. the code that completes processing the page is not interrupted by other completions).

• Used in IXE (1024 connection at once)

Page 18: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Page retrievalPage retrieval

3.Persistent connection

• Multiple documents requested on a single connection

• Feature of HTTP 1.1

• Reduce the number of HTTP connection setups

• Used in IXE

Page 19: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

IXE CrawlerIXE Crawler

Retriever

Crawler Parser

Scheduler Retriever

Retriever

Cache CrawlInfo

select()

Table <UrlInfo> Citations

Hosts Robots

Host queues

Feeder

select()

thread

synch. obj

memory disk

UrlEnumerator

Page 20: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

IXE ParserIXE Parser

• Problem: parsing requires 30% of execution time

• Possible solution: distributed parsing

Page 21: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

IXE ParserIXE Parser

URL1

URL2

CacheParserURL Table Manager

(“Crawler”)

Table<UrlInfo> Citations

URL1

URL2

URL1

URL2

DocID1

DocID2

DocID1

DocID2 URL1

URL2

Page 22: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

A distributed parserA distributed parser

CacheSchedulerCitations

Parser 1

Table 1<UrlInfo>

Table 1Manager

Parser N

Table 2<UrlInfo>

Table 2Manager

Hash (URL1)→

Manager2

URL1

URL2

Sched ()→

Parser1

URL1

URL2

URL1

URL2

URL1

URL2

DocID2

DocID1

Hash(URL2)→

Manager1

?

New DocID

HIT

MISS

Page 23: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

A distributed parserA distributed parser

• Does this solution scale?

- High traffic on the main link

• Suppose that:

- Average page size = 10KB

- Average out-links per page = 10

- URL size = 40 characters (40 bytes)

- DocID size = 5 byte

• X = throughput (pages per second)

• N = number of parsers

Page 24: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

A distributed parserA distributed parser

• Bandwidth for web pages:

- X*10*1024*8 = 81920*X bps

• Bandwidth for messages (hit):

- X/N * 10 * (40+5) * 8 * N = 3600*X bps

Pages per parser

Outlinks per page

DocIDReply

Byte → bit

Numberof parsers

DocIDRequest

• Using 100Mbps : X = 1226 pages per second

Page 25: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

What we don’t want to What we don’t want to fetchfetch

1.Spider traps

2.Duplicates

2.1 Different URLs for the same page

2.2 Already visited URLs

2.3 Same document on different sites

2.4 Mirrors

• At least 10% of the hosts are mirrored

Page 26: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Spider trapsSpider traps

• Spider trap: hyperlink graph constructed unintentionally or malevolently to keep a crawler trapped

1. Infinitely “deep” Web sites

• Problem: using CGI is possible to generate an infinite number of pages

• Solution: check of the URL length

Page 27: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Spider trapsSpider traps

2.Large number of dummy pages

• Example: http://www.troutbums.com/Flyfactory/flyfactory/flyfactory/hatchlin

e/hatchline/flyfactory/hatchline/flyfactory/hatchline/flyfactory/

flyfactory/flyfactory/hatchline/flyfactory/hatchline/

• Solution: disable crawling

• a guard removes from consideration any

URL from a site which dominates the

collection

Page 28: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Avoid duplicatesAvoid duplicates

• Problem almost nonexistent in classic IR

• Duplicate content

• wastes resources (index space)

• annoys users

Page 29: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Virtual HostingVirtual Hosting

• Problem: Virtual Hosting

• Allows to map different sites to a single IP address

• Could be used to create duplicates

• Feature of HTTP 1.1

• Rely on canonical hostnames (CNAMEs) provided by DNS

http://www.cocacola.com

http://www.coke.com

129.33.45.163

Page 30: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Already visited URLsAlready visited URLs

• Problem: how to recognize an already visited URL ?

• The page is reachable by many paths

• We need an efficient Duplicate URL Eliminator

Page 31: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Already visited URLsAlready visited URLs

1.Bloom Filter

• Probabilistic data structure for set membership testing

• Problem: false positivs

• new URLs marked as already seen

URL

hash function 1

hash function 2

hash function n

BIT VECTOR

0/1

0/1

0/1

Page 32: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Already visited URLsAlready visited URLs

2.URL hashing

• MD5

• Using a 64-bit hash function, a billion URLs requires 8GB

- Does not fit in memory

- Using the disk limit the crawling rate to 75 downloads per second

MD5URL Digest

128 bits

Page 33: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Already visited URLsAlready visited URLs

3. two-level hash function

• The crawler is luckily to explore URLs within the same site

• Relative URLs create a spatiotemporal locality of access

• Exploit this kind of locality using a cache

PathHostname+Port

24 bits 40 bits

Page 34: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Content based Content based techniquestechniques

• Problem: how to recognize duplicates basing on the page contents?

1.Edit distance

• Number of replacements required to transform one document to the other

• Cost: l1*l2, where l1 and l2 are the lenghts of the documents: Impractical!

Page 35: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Content based Content based techniquestechniques

Problem: pages could have minor syntatic differences !

• site mantainer’s name, latest update

• anchors modified

• different formatting

2.Hashing

• A digest associated with each crawled page

• Used in Mercator

• Cost: one seek in the index for each new crawled page

Page 36: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Content based Content based techniquestechniques

3.Shingling

• Shingle (or q-gram): contiguous subsequence of tokens taken from document d

• representable by a fixed length integer

• w-shingle: shingle of width w

• S(d,w): w-shingling of document d

• unordered set of distinct w-shingles contained in document d

Page 37: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Content based Content based techniquestechniques

a rose is a rose is a roseSentence:

Tokens: a rose is a rose is a rose

a,rose,is,a

rose,is,a,rose

is,a,rose,is

a,rose,is,a

rose,is,a,rose

4-shingles:

S(d,4): a,rose,is,a rose,is,a,rose is,a,rose,is

Page 38: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Content based Content based techniquestechniques

• Each token = 32 bit

• w = 10 (suitable value)

• S(d,10) = set of 320-bits numbers

• We can hash the w-shingles and keep 500

bytes of digests for each document

w-shingle=320 bit

Page 39: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Content based Content based techniquestechniques

• Resemblance of documents d1 and d2:

),2(),1(

),2(),1()2,1(

wdSwdS

wdSwdSddr

Jaccard coefficient

• Eliminate pages too similar (pages whose resem-blance value is close to 1)

Page 40: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

MirrorsMirrors

http://www.research.digital.com/SRC/

access

method

hostname path

URL

• Precision = relevant retrieved docs / retrieved

docs

Page 41: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

MirrorsMirrors

1.URL String based

• Vector Space model: term vector matching to compute the likelyhood that a pair of hosts are mirrors

• terms with df(t) < 100

Page 42: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

MirrorsMirrors

a) Hostname matching

• Terms: substrings of the hostname

• Term weighting:

))(log(1

))(log(

tdf

tlen

len(t)= number of segments obtained by breaking the term at ‘.’ characters

• This weighting favours substrings composed by many segments very specific

27%

Page 43: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

MirrorsMirrors

b) Full path matching

• Terms: entire paths

• Term weighting:

))(

log(1tdf

mdf

Connectivity based filtering stage:

• Idea: mirrors share many common paths

• Testing for each common path if it has the same set of out-links on both hosts

• Remove hostnames from local URLs

mdf = max df(t)t∈collection

59%

+19%

Page 44: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

MirrorsMirrors

c) Positional word bigram matching

• Terms creation:

• Break the path into a list of words by treating ‘/’ and ‘.’ as breaks

• Eliminate non-alphanumeric characters

• Replace digits with ‘*’ (effect similar to stemming)

• Combine successive pairs of words in the list

• Append the ordinal position of the first word

72%

Page 45: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

MirrorsMirrors

conferences/d299/advanceprogram.html

conferences

d*

advanceprogram

html

conferences_d*_0

d*_advanceprogram_1

advanceprogram_html_2

Positional

Word

Bigrams

Page 46: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

MirrorsMirrors

2.Host connectivity based

• Consider all documents on a host as a single large document

• Graph:

• host → node

• document on host a pointing to a document on host B → directed edge from A to B

• Idea: two hosts are likely to be mirrors if their nodes point to the same nodes

• Term vector matching

- Terms: set of nodes that a host’s node points to

45%

Page 47: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

ReferencesReferences

S. Chakrabarti and M. Kaufmann, Mining the Web: Analysis of Hypertext and Semi Structured Data, 2002. Pages 17-43,71-72.

S.Brin and L.Page, The anatomy of a large-scale hypertextual Web search engine. Proceedings of the 7th World Wide Web Conference (WWW7), 1998.

A.Heydon and M.Najork, Mercator: A scalable, extensible Web crawler, World Wide Web Conference, 1999.

K.Bharat, A.Broder, J.Dean, M,R.Henzinger, A comparison of Techniques to Find Mirrored Hosts on the WWW, Journal of the American Society for Information Science, 2000.

Page 48: Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

ReferencesReferences

A.Heydon and M.Najork, High performance Web Crawling, Technical Report, SRC Research Report, 173, Compaq Systems Research Center, 26 September 2001.

R.C.Miller and K.Bharat, SPHINX: a framework for creating personal, site-specific web crawlers, Proceedings of the 7th World-Wide Web Conference, 1998.

D. Zeinalipour-Yazti and M. Dikaiakos. Design and Implementation of a Distributed Crawler and Filtering Processor, Proceedings of the 5th Workshop on Next Generation Information Technologies and Systems (NGITS 2002), June 2002.