March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark...

March 26, 2003 CS502 Web Information Systems 1

Web Crawling and Automatic Discovery

Donna Bergmark

Cornell Information [email protected]


Web Resource Discovery

• Finding info on the Web– Surfing (random strategy; goal is serendipity)

– Searching (inverted indices; specific info)

– Crawling (follow links; “all” the info)

• Uses for crawling– Find stuff

– Gather stuff

– Check stuff


Definition

Spider = robot = crawler

Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.


Crawlers and internet history• 1991: HTTP• 1992: 26 servers• 1993: 60+ servers; self-register; archie• 1994 (early) – first crawlers• 1996 – search engines abound• 1998 – focused crawling• 1999 – web graph studies• 2002 – use for digital libraries


So, why not write a robot?

You’d think a crawler would be easy to write:

Pick up the next URL Connect to the server GET the URL When the page arrives, get its links (optionally do other stuff)

REPEAT


The Central Crawler Function

Server 2queue

Server 1queue

Server 3queue

URL -> IP

address via

DNS

Connect aSocket to

Server; sendHTTP request

Wait forthe

response:An HTML

page


Handling the HTTP Response

Documentseen

before?

FETCH Processthis

document

No

Extract text

Extract links

::


LINK Extraction

• Finding the links is easy (sequential scan)

• Need to clean them up and canonicalize them

• Need to filter them

• Need to check for robot exclusion

• Need to check for duplicates


Update the Frontier

FETCH PROCESS

URL1URL2URL3 :

FRONTIER


Crawler Issues

• System Considerations

• The URL itself

• Politeness

• Visit Order

• Robot Traps

• The hidden web


Standard for Robot Exclusion

• Martin Koster (1994)

• http://any-server:80/robots.txt

• Maintained by the webmaster

• Forbid access to pages, directories

• Commonly excluded: /cgi-bin/

• Adherence is voluntary for the crawler


Visit Order

• The frontier

• Breadth-first: FIFO queue

• Depth-first: LIFO queue

• Best-first: Priority queue

• Random

• Refresh rate


Robot Traps

• Cycles in the Web graph

• Infinite links on a page

• Traps set out by the Webmaster


The Hidden Web

• Dynamic pages increasing

• Subscription pages

• Username and password pages

• Research in progress on how crawlers can “get into” the hidden web

15CS502 Web Information SystemsMarch 26, 2003

MERCATOR


Mercator Features

• One file configures a crawl• Written in Java• Can add your own code

– Extend one or more of M’s base classes– Add totally new classes called by your own

• Industrial-strength crawler:– uses its own DNS and java.net package


The Web is a BIG Graph

• “Diameter” of the Web

• Cannot crawl even the static part, completely

• New technology: the focused crawl


Crawling and Crawlers

• Web overlays the internet

• A crawl overlays the webseed


Focused Crawling


Focused Crawling

432

765

1

1

R

Breadth-first crawl

1

432

5R

X X

Focused crawl


Focused Crawling• Recall the cartoon for a focused crawl:

• A simple way to do it is with 2 “knobs”

1

432

5R

X X


Focusing the Crawl

• Threshold: page is on-topic if correlation to the closest centroid is above this value

• Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than this value


Illustration

2 3

4

6

7

1

5555

Cutoff = 1

Corr >= threshold


Min-avg-max correlation vs. crawl length

00.10.2

0.30.40.50.6

0.70.8

0 20000 40000 60000 80000 100000 120000

No. documents downloaded

corr

elat

ion Maximum

Average

Minimum

Closest

Furthest


Correlation vs. Crawl Length


Fall 2002 Student Project

Query

Mercator

Centroid Collection Description

Term vectors

Centroids,Dictionary

CollectionURLs

Chebyshev P.s HTML


Conclusion

• We covered crawling – history, technology, deployment

• Focused crawling with tunneling

• We have a good experimental setup for exploring automatic collection synthesis


http://mercator.comm.nsdlib.org

March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark...

Documents

Transcript of March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark...