Internet Client-Server Systems - York University

Web Crawlers

Definition

Spider = robot = crawler

Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

• pages containing (fairly unstructured) text

• images, audio, etc. embedded in pages

• structure defined using HTML

(Hypertext Markup Language)

• hyperlinks between pages!

• over 2.9 billion pages

• over 16 billion hyperlinks

a giant graph!

What is the Web? (another view)

• pages reside in servers

• related pages in sites

• local versus global links

• logical vs. physical structure

How is the Web organized?

Web Server(Host)

Web Server(Host)

Web Server(Host)

www.poly.edu

www.cnn.com

www.irs.gov

How the Web Works

Desktop(with browser)

give me the file “/world/index.html”

here is the file: “...”

Web Server

www.cnn.com

Fetching “www.cnn.com/world/index.html”

• more than 2.9 billion pages

• more than 16 billion hyperlinks

• plus images, movies, .. , database content

How do we find pages on the Web?

we need specialized tools for finding

pages and information

Overview of web search tools

• Major search engines

(google, alltheweb, altavista, northernlight, hotbot, excite, go)

• Web directories: (yahoo, open directory project)

• Specialized search engines (cora, csindex, achoo, findlaw)

• Local search engines (for one site)

• Meta search engines (beaucoup, allsearchengines, about)

• Personal search assistants (alexa, zapper)

• Comparison shopping agents (mysimon, dealtime, price)

• Image search (ditto, visoo)

• Natural language questions (askjeeves?, northernlight?)

• Database search (completeplanet, direct, invisibleweb)

Major search engines

Basic structure of a search engine:

Crawler

disks

Index

indexing

Search.comQuery: “computer”

look up

Ranking:• return best pages first

• term- vs. link-based approaches

http://www.google.com/

Example #1:

• Ragerank (Brin&Page/Google)

“significance of a page depends on

significance of those referencing it”

• HITS (Kleinberg/IBM)

“Hubs and Authorities”

Link-based ranking techniques

• coverage (need to cover large part of the web)

• good ranking (in the case of broad queries)

• freshness (need to update content)

• user load (up to 3000 queries/sec - Google)

• manipulation (sites want to be listed first)

Challenges for search engines:

need to crawl and store massive data sets

smart informational retrieval techniques

frequent recrawling of content

many queries on massive data

naïve techniques will be exploited quickly

Web directories:

• designing topic hierarchy

• automatic classification: “what is this page about?”

• Yahoo and Open Directory mostly human-based

Topic hierarchy: everything

sports politics healthbusiness

baseball

hockey

soccer

….

foreign

domestic

....

....

....

....

....

Challenges:

Specialized search engines:

• be the best on one particular topic

• use domain-specific knowledge

• limited resources do not crawl the entire web!

• focused crawling techniques

• uses other search engines to answer questions

• ask the right specialized search engine

• combine results from several large engines

• needs to be “familiar” with thousands of engines

Meta search engines:

Personal Search Assistants: (alexa, zapper)

• embedded into browser

• can suggest “related pages”

• search by “highlighting text” can use context

• may exploit individual browsing behavior

• may collect and aggregate browsing information

privacy issues

• crawl the web (alexa), or

use existing search engines (zapper)

Web Search Information System

Web Search Information System

Query

and

Feedback

User

Interface

Crawling

IndexingQuery

Processing

Document Repository

Search Engine

Knowledge Base

Learning

Learning

Inference Engine

Ad Hoc

Information

Large Text (Multimedia) Database Tech.

Data(Text) Mining Tech.

Query and

Feedback

User

Interface

Crawling

IndexingQuery

Processing

Document Repository

Search Engine

Knowledge Base

Learning

Learning

Inference Engine

Ad Hoc

Information

Large Text (Multimedia) Database Tech.

Data(Text) Mining Tech.

Perspective

algorithms

information

systems

information

retrieval

databases

data mining

machine learning

AIUser Interface

Search Engine Architecture:

Crawler

disks

Index

indexing

Search.comQuery: “computer”

look up

Web Crawlers

Crawler

Crawler

disks

• starts at set of “seed pages”

• fetches pages from the web

• parses fetched pages for hyperlinks

• then follows those links (e.g., BFS)

• variations:

- random walks

- focused crawling

Grab

URL DBSeed List

Discovery

Typical Crawler Architecture

Internet

Pagefiles

Filtered Pagefiles IndexPagefiles

Anchor Text DB

Connectivity DB

Duplicates DB

Alias DB

Index Build

Crawler

Web Crawler◆ Retrieving Module

◆ Processing Module

◆ Formatting Module

◆ URL Listing Module

◆ The order of traversing

➢ Breadth-first

➢ Depth-first

➢ Better pages first

◆ How frequently the index is updated

Word Wide Web

Database

Retrieving Module

Processing Module

Formatting Module

URL Listing Module

Mining the World Wide Web (pages 107-110)

2

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

Simple Crawler AlgorithmSimple-Crawler ( S0, D, E )

1 Q S0

2 While Q

3 do u DEQUEUE (Q)

4 d(u) FETCH (u)

5 STORE (D, (d(u), u))

6 L PARSE (d(u))

7 For each v in L

8 Do STORE (E, (u, v))

9 If (v D v Q)

10 Then ENQUEUE (Q, v).

S0 is the seed URL. Q is the “to visit URLs” queue.

L is the set of children URLs of u. D is the “visited URLs” queue.

Queuehttp://www.semmel.com

http://www.jhuapl.edu/

http://familysearch.com/

How Web Search Engines Work: Indexing

◆ Place seed URLs into a priority queue

◆ Repeatedly

➢ Select next URL from queue

➢ Fetch page

➢ Characterize page

➢ Store characterization in index

➢ Extract links from page

➢ Assign priority to each link

➢ Add links to queue

The

Web

Indexavocado doc3

doc177baby doc3

doc42doc117

beanie doc42doc77doc193

...

0.943

0.424

Ralph’s Web Page

My favorite color is

lavender!

I collect Beanie

Babies!

See pictures of my

moss garden!

doc42

babybeaniecollectcolor…

http://www.ty.com/

http://www.semmel.com/garden/

...

http://www.semmel.com/

doc3

doc42

doc77

doc117

doc193

...

How Web Search Engines Work: Retrieval

◆ Retrieve query from user

◆ Characterize query

◆ Use index to find documents that contain query terms

◆ Measure similarity between query and each potentially relevant document

◆ Sort documents by similarity score

◆ Return documents with highest scores to user

lavender Beanie Babies

Indexavocado doc3

doc177baby doc3

doc42doc117

beanie doc42doc77doc193

...

baby

beanie

lavender

.331 doc3

.924 doc42

.624 doc77

.841 doc117

.118 doc193

...

.924 doc42

.841 doc117

.624 doc77

.331 doc3

.118 doc193

...

Search Results

1. Ralph’s Web Page

2. Ty Homepage

3. Toys R Expensive

4. Caps for Freshmen

5. Bohnanza

6. Ralph’s Lavender Page

7. 404 Not Found

8. Hot Men in Tight Shorts

Crawling Issues

◆ How to crawl? ➢ Quality: “Best” pages first

➢ Efficiency: Avoid duplication (or near duplication)

◆ How much to crawl? How much to index?➢ Coverage: How big is the Web? How much do we cover?

➢ Relative Coverage: How much do competitors have?

◆ How often to crawl?➢ Freshness: How much has changed?

➢ How much has really changed?

◆ Visit order and the hidden web

Visit Order

◆ Breadth-first: FIFO queue

◆ Depth-first: LIFO queue

◆ Best-first: Priority queue

◆ Random

◆ Refresh rate

Breadth First Crawlers

Breadth First Crawlers

◆ Use breadth-first search (BFS) algorithm

◆ Get all links from the starting page, and

add them to a queue

◆ Pick the 1st link from the queue, get all

links on the page and add to the queue

◆ Repeat above step till queue is empty

Simple Breadth-First Search Crawler

this will eventually download all pages reachable from the start set

insert set of initial URLs into a queue Q

while Q is not empty

currentURL = dequeue(Q)

download page from currentURL

for any hyperlink found in the page

if hyperlink is to a new page

enqueue hyperlink URL into Q

Depth First Crawlers

Depth First Crawlers

◆ Use depth first search (DFS) algorithm

◆ Get the 1st link not visited from the start

page

◆ Visit link and get 1st non-visited link

◆ Repeat above step till no no-visited links

◆ Go to next non-visited link in the

previous level and repeat 2nd step

Traversal strategies: (why BFS?)

• crawl will quickly spread all over the web

• load-balancing between servers

• in reality, more refined strategies (but still BFSish)

• Scripting languages (Python, Perl)

• Java (performance tuning tricky)

• C/C++ with sockets (low-level)

• available crawling tools (usually not scalable)

Tools/languages for implementation:

Focused Crawling

Focused Crawler: selectively seeks out pages that are relevant to a pre-defined set of topics.

- Topics specified by using exemplary documents (not keywords)

- Crawl most relevant links

- Ignore irrelevant parts.

- Leads to significant savings in hardware and network resources.

Web Indexer

Index Issues

◆ How to structure the index

◆ How to create the index (storage, time)

◆ How to store the index (storage, compression)

◆ How to process the index (storage, time)

◆ How to update the index (storage, time)

Inverted File Indexing

◆ Inverted file index

➢ contains a list of terms that appear in the document

collection (called a lexicon or vocabulary)

➢ and for each term in the lexicon, stores a list of

pointers to all occurrences of that term in the

document collection. This list is called an inverted list.

Inverted File Indexing

◆ Postings file

◆ Inverted file contains➢ Postings: for each term in the lexicon, a list of

pointers to all occurrences of that term in the main text; stored in increasing document ID

➢ Lexicon: mapping from terms to pointer list

Lexicon and Postings File

Salmon 5 PTR

<5,23> <12,95> <16,22> <21,12> <25,42>

◼Document 5: ….The extinction of Atlantic salmon is predicted if actions to preserve stocks are not taken…

Inverted files

◆ Index information, whether manual or automatic, is stored in an inverted file

◆ Doc. 1: The cat is on the mat

◆ Doc. 2: The mat is on the floor.

1,22Mat

11Floor

11Cat

postingsno. of

occurrences

Structure of Inverted Index

◆ Document-level indexing

No. Term Documents

1 cold <2; 1,4>

2 days <2; 3,6>

◆ word-level indexing

1 cold <2;(1:6) ,(4:8)>

Document ID

position ID

Document ID

Structure of Inverted Index

◆ May be a hierarchical set of addresses, e.g.

word number within sentence number within paragraph number within chapter number within volume number within document number

◆ Consider as a vector (d,v,c,p,s,w)

Compression of Inverted Indexes

◆ Uncompressed, maybe 50 – 100% of

size of text

◆ Compression: store differences rather

than document numbers

➢ E.g. (8:3,5,20,21,23,76,77,78)

→ (8:3,2,15,1,2,53,1,1)

Then code differences using global (for all

lists) or local (for each list) methods

Indexing: (Simplified Approach)

(1) scan through all documents

(2) for every work encountered

generate entry (word, doc#, pos)

(3) sort entries by (word, doc#, pos)

(4) now transform into final form

doc1: “Bob reads a book”

doc2: “Alice likes Bob”

doc3: “book”

(bob, 1, 1), (reads, 1, 2), (a, 1, 3)

(book,1, 4), (alice, 2, 1), (likes, 2, 2)

(bob, 2, 3), (book, 3, 1)

(a, 1, 3), (alice, 2, 1), (bob, 1, 1),

(Bob, 2, 3), (book, 1, 4), (book, 3, 1),

(likes, 2, 2), (reads, 1, 2)

a: (1,3)

Alice: (2, 1)

Bob: (1, 1), (2, 3)

book: (1, 4), (3, 1)

likes: (2, 2)

reads: (1, 2)

1-level

Improvements

• encode sorted runs by their gaps

significant compression for frequent words!

• less effective if we also store position(adds incompressible lower order bits)

• many highly optimized schemes have been studied

(see Witten/Moffat/Bell)

.

.arm 4, 19, 29, 98, 143, ...

armada 145, 457, 789, ...

armadillo 678, 2134, 3970, ...

armani 90, 256, 372, 511, ...

.

.

.

.arm 4, 15, 10, 69, 45, ...

armada 145, 312, 332, ...

armadillo 678, 1456, 1836, ...

armani 90, 166, 116, 139, ...

.

.

• keep data compressed during index construction

• try to keep index in main memory? (altaVista)

• keep important parts in memory? (fancy hits in google)

• use database to store lists? (e.g., Berkeley DB)

Additional issues:

Alternative to inverted index:

• signature files (Bloom filters): false positives

• bitmaps

• better to stick with inverted files (Witten/Moffat/Bell)

Standard Web Search Engine Architecture

crawl theweb

create an invertedindex

Check for duplicates,store the documents

Inverted

index

Search

engine

servers

user

query

Show results To user

DocIds

How Inverted Files Are Created

◆ Periodically rebuilt, static otherwise.

◆ Documents are parsed to extract tokens. These are saved with the Document ID.

Now is the time

for all good men

to come to the aid

of their country

Doc 1

It was a dark and

stormy night in

the country

manor. The time

was past midnight

Doc 2

Term Doc #

now 1

is 1

the 1

time 1

for 1

all 1

good 1

men 1

to 1

come 1

to 1

the 1

aid 1

of 1

their 1

country 1

it 2

was 2

a 2

dark 2

and 2

stormy 2

night 2

in 2

the 2

country 2

manor 2

the 2

time 2

was 2

past 2

midnight 2

How Inverted Files are Created

◆ After all documents have been parsed, the inverted file is sorted alphabetically.

Term Doc #

a 2

aid 1

all 1

and 2

come 1

country 1

country 2

dark 2

for 1

good 1

in 2

is 1

it 2

manor 2

men 1

midnight 2

night 2

now 1

of 1

past 2

stormy 2

the 1

the 1

the 2

the 2

their 1

time 1

time 2

to 1

to 1

was 2

was 2

Term Doc #

now 1

is 1

the 1

time 1

for 1

all 1

good 1

men 1

to 1

come 1

to 1

the 1

aid 1

of 1

their 1

country 1

it 2

was 2

a 2

dark 2

and 2

stormy 2

night 2

in 2

the 2

country 2

manor 2

the 2

time 2

was 2

past 2

midnight 2

◆ Multiple term entries for a single document are merged.

◆ Within-document term frequency information is compiled.

Term Doc # Freq

a 2 1

aid 1 1

all 1 1

and 2 1

come 1 1

country 1 1

country 2 1

dark 2 1

for 1 1

good 1 1

in 2 1

is 1 1

it 2 1

manor 2 1

men 1 1

midnight 2 1

night 2 1

now 1 1

of 1 1

past 2 1

stormy 2 1

the 1 2

the 2 2

their 1 1

time 1 1

time 2 1

to 1 2

was 2 2

Term Doc #

a 2

aid 1

all 1

and 2

come 1

country 1

country 2

dark 2

for 1

good 1

in 2

is 1

it 2

manor 2

men 1

midnight 2

night 2

now 1

of 1

past 2

stormy 2

the 1

the 1

the 2

the 2

their 1

time 1

time 2

to 1

to 1

was 2

was 2



◆ Finally, the file can be split into

➢ A Dictionary or Lexicon file

➢ A Postings file

How Inverted Files are CreatedDictionary/Lexicon

Term Doc # Freq

a 2 1

aid 1 1

all 1 1

and 2 1

come 1 1

country 1 1

country 2 1

dark 2 1

for 1 1

good 1 1

in 2 1

is 1 1

it 2 1

manor 2 1

men 1 1

midnight 2 1

night 2 1

now 1 1

of 1 1

past 2 1

stormy 2 1

the 1 2

the 2 2

their 1 1

time 1 1

time 2 1

to 1 2

was 2 2

Doc # Freq

2 1

1 1

1 1

2 1

1 1

1 1

2 1

2 1

1 1

1 1

2 1

1 1

2 1

2 1

1 1

2 1

2 1

1 1

1 1

2 1

2 1

1 2

2 2

1 1

1 1

2 1

1 2

2 2

Term N docs Tot Freq

a 1 1

aid 1 1

all 1 1

and 1 1

come 1 1

country 2 2

dark 1 1

for 1 1

good 1 1

in 1 1

is 1 1

it 1 1

manor 1 1

men 1 1

midnight 1 1

night 1 1

now 1 1

of 1 1

past 1 1

stormy 1 1

the 2 4

their 1 1

time 2 2

to 1 2

was 1 2

Postings

Implementation Based on Inverted Files

system

computer

database

science D2, 4

D5, 2

D1, 3

D7, 4

Index terms df

3

2

4

1

Dj, tfj

Index file Postings lists

• • •

Inverted Indexes

◆ Permit fast search for individual terms

◆ For each term, you get a list consisting of:

➢ document ID

➢ frequency of term in doc (optional)

➢ position of term in doc (optional)

◆ These lists can be used to solve Boolean queries:

➢ country -> d1, d2

➢manor -> d2

➢ country AND manor -> d2

◆ Also used for statistical ranking algorithms

Inverted Indexes for Web Search Engines

◆ Inverted indexes are still used, even though the web is so huge.

◆ Some systems partition the indexes across different machines. Each machine handles different parts of the data.

◆ Other systems duplicate the data across many machines; queries are distributed among the machines.

◆ Most do a combination of these.

Summary

Search Engines

◆ Search engines are the most popular way to locate information online

◆ About 33 million U.S. Internet users query on search engines on a typical day.

◆ More than 80% have used search engines

◆ Search Engines are measured by coverage and recency.

Search Engine Architecture

Generic

Crawler

Storage

Server

Index

Server

Graph

Server

BFS-

Crawler

Focused

Crawler

User

Interface

User

Tools

Admin

Interface

Data Acquisition

User Interfaces

Scalable Server Components

W

W

W

WWW

Working of a Local Search Engine

Search Form

Indexer

Web Site Documents

Gets

words

IndexStores

Words Looks in

Index

Gets Matches

Sends Query

Search Engine

Results Page

Sends

Formatted

Results

Retrieved Page

User views Retrieved Page

User Selects

required

page

Indexing

disks

• parse & build lexicon & build index

• index very large

I/O-efficient techniques needed

“inverted index”

indexing

aardvark 3452, 11437, …..

.

.

.

.

.arm 4, 19, 29, 98, 143, ...

armada 145, 457, 789, ...

armadillo 678, 2134, 3970, ...

armani 90, 256, 372, 511, ...

.

.

.

.

.zebra 602, 1189, 3209, ...

Indexing

disks

• how to build an index

- in I/O-efficient manner

- in parallel - later

- …...

• how to compress an index (while building it in situ)

• goal: intermediate size not much larger than final size

“inverted index”

indexing

aardvark 3452, 11437, …..

.

.

.

.

.arm 4, 19, 29, 98, 143, ...

armada 145, 457, 789, ...

armadillo 678, 2134, 3970, ...

armani 90, 256, 372, 511, ...

.

.

.

.

.zebra 602, 1189, 3209, ...

• lexicon: set of all “words” encountered

millions in the case of the web

• postings: for each word occurrence -

store index of document where it occurs

• also store position in document? (probably yes)

- increases space for index significantly!

- allows efficient search for phrases

- relative positions of words may be important for ranking

• stop words: common words such as “is”, “a”, “the”

• ignore stop words? (maybe better not)

- saves space in index

- cannot search for “to be or not to be”

• stemming: “runs = run = running” (depends on language)

Basic concepts and choices:

Stop Lists

◆ Lists of words which are dropped from processing

◆ Few words to hundreds; may include single letters

◆ E.g. Dialog: AN, AND, BY, FOR, FROM, OF,

THE, TO, WITH

◆ Improve storage efficiency (may be 10 to 50% of

text)

◆ Improve processing efficiency

◆ May cause problems:

➢ “to be or not to be”, “AT&T”

➢ “man of war”, “birds of prey”

Stemming

◆ Deals with word variation (morphological variants)

➢ E.g.: Computer, computers, computing, compute,

computed, computational, computationally

➢ ➔ comput

◆ Use a stemming algorithm for conflation

➢ Set of rules applied to each word as it is processed

◆ Simplest: combine singular and plural form

◆ Examples: Porter, Lovins, Paice

Simple Stemmer

◆ If a word ends in “ies” but not “eies” or aies”

➢ Then “ies” → “y”

◆ If a word ends in “es” but not “aes” , “ees”, or “oes”

➢ Then “es” → “e”

◆ If a word ends in “s”, but not “us” or “ss”

➢ Then “s” → NULL

◆ (apply only first applicable rule)

◆ e.g. spiders, flies, throes, bees

Impact of Stemmers

◆ May decrease index file size up to 50%

◆ Should increase recall at cost of precision

◆ Studies are equivocal; some improvements

found but not marked

◆ May depend on nature of vocabulary

Availability of Stemmers

◆ Many on Web, e.g. see

➢ http://www.tartarus.org/~martin/PorterStemmer/index.html

for many encodings of Porter Stemmer

◆ Or http://sourceforge.net/projects/stemmers/

for encodings of the Lovins Stemmer

http://www.tartarus.org/~martin/PorterStemmer/index.html

http://sourceforge.net/projects/stemmers/

Querying

Boolean queries:

(zebra AND armadillo) OR armani

unions/intersections of lists

look up

aardvark 3452, 11437, …..

.

.

.

.

.arm 4, 19, 29, 98, 143, ...

armada 145, 457, 789, ...

armadillo 678, 2134, 3970, ...

armani 90, 256, 372, 511, ...

.

.

.

.

.zebra 602, 1189, 3209, ...

Querying and term-based ranking:

Recall Boolean queries:

(zebra AND armadillo) OR armani

unions/intersections of lists

look up

aardvark 3452, 11437, …..

.

.

.

.

.arm 4, 19, 29, 98, 143, ...

armada 145, 457, 789, ...

armadillo 678, 2134, 3970, ...

armani 90, 256, 372, 511, ...

.

.

.

.

.zebra 602, 1189, 3209, ...

Internet Client-Server Systems - York University

Documents

Transcript of Internet Client-Server Systems - York University