Internet Client-Server Systems - York University

72
Web Crawlers

Transcript of Internet Client-Server Systems - York University

Page 1: Internet Client-Server Systems - York University

Web Crawlers

Page 2: Internet Client-Server Systems - York University

Definition

Spider = robot = crawler

Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

Page 3: Internet Client-Server Systems - York University

• pages containing (fairly unstructured) text

• images, audio, etc. embedded in pages

• structure defined using HTML

(Hypertext Markup Language)

• hyperlinks between pages!

• over 2.9 billion pages

• over 16 billion hyperlinks

a giant graph!

What is the Web? (another view)

Page 4: Internet Client-Server Systems - York University

• pages reside in servers

• related pages in sites

• local versus global links

• logical vs. physical structure

How is the Web organized?

Web Server(Host)

Web Server(Host)

Web Server(Host)

www.poly.edu

www.cnn.com

www.irs.gov

Page 5: Internet Client-Server Systems - York University

How the Web Works

Desktop(with browser)

give me the file “/world/index.html”

here is the file: “...”

Web Server

www.cnn.com

Fetching “www.cnn.com/world/index.html”

Page 6: Internet Client-Server Systems - York University

• more than 2.9 billion pages

• more than 16 billion hyperlinks

• plus images, movies, .. , database content

How do we find pages on the Web?

we need specialized tools for finding

pages and information

Page 7: Internet Client-Server Systems - York University

Overview of web search tools

• Major search engines

(google, alltheweb, altavista, northernlight, hotbot, excite, go)

• Web directories: (yahoo, open directory project)

• Specialized search engines (cora, csindex, achoo, findlaw)

• Local search engines (for one site)

• Meta search engines (beaucoup, allsearchengines, about)

• Personal search assistants (alexa, zapper)

• Comparison shopping agents (mysimon, dealtime, price)

• Image search (ditto, visoo)

• Natural language questions (askjeeves?, northernlight?)

• Database search (completeplanet, direct, invisibleweb)

Page 8: Internet Client-Server Systems - York University

Major search engines

Page 9: Internet Client-Server Systems - York University

Basic structure of a search engine:

Crawler

disks

Index

indexing

Search.comQuery: “computer”

look up

Page 10: Internet Client-Server Systems - York University

Ranking:• return best pages first

• term- vs. link-based approaches

Page 11: Internet Client-Server Systems - York University

Example #1:

• Ragerank (Brin&Page/Google)

“significance of a page depends on

significance of those referencing it”

• HITS (Kleinberg/IBM)

“Hubs and Authorities”

Link-based ranking techniques

Page 12: Internet Client-Server Systems - York University

• coverage (need to cover large part of the web)

• good ranking (in the case of broad queries)

• freshness (need to update content)

• user load (up to 3000 queries/sec - Google)

• manipulation (sites want to be listed first)

Challenges for search engines:

need to crawl and store massive data sets

smart informational retrieval techniques

frequent recrawling of content

many queries on massive data

naïve techniques will be exploited quickly

Page 13: Internet Client-Server Systems - York University

Web directories:

Page 14: Internet Client-Server Systems - York University

• designing topic hierarchy

• automatic classification: “what is this page about?”

• Yahoo and Open Directory mostly human-based

Topic hierarchy: everything

sports politics healthbusiness

baseball

hockey

soccer

….

foreign

domestic

....

....

....

....

....

Challenges:

Page 15: Internet Client-Server Systems - York University

Specialized search engines:

• be the best on one particular topic

• use domain-specific knowledge

• limited resources do not crawl the entire web!

• focused crawling techniques

• uses other search engines to answer questions

• ask the right specialized search engine

• combine results from several large engines

• needs to be “familiar” with thousands of engines

Meta search engines:

Page 16: Internet Client-Server Systems - York University

Personal Search Assistants: (alexa, zapper)

• embedded into browser

• can suggest “related pages”

• search by “highlighting text” can use context

• may exploit individual browsing behavior

• may collect and aggregate browsing information

privacy issues

• crawl the web (alexa), or

use existing search engines (zapper)

Page 17: Internet Client-Server Systems - York University

Web Search Information System

Page 18: Internet Client-Server Systems - York University

Web Search Information System

Query

and

Feedback

User

Interface

Crawling

IndexingQuery

Processing

Document Repository

Search Engine

Knowledge Base

Learning

Learning

Inference Engine

Ad Hoc

Information

Large Text (Multimedia) Database Tech.

Data(Text) Mining Tech.

Query and

Feedback

User

Interface

Crawling

IndexingQuery

Processing

Document Repository

Search Engine

Knowledge Base

Learning

Learning

Inference Engine

Ad Hoc

Information

Large Text (Multimedia) Database Tech.

Data(Text) Mining Tech.

Page 19: Internet Client-Server Systems - York University

Perspective

algorithms

information

systems

information

retrieval

databases

data mining

machine learning

AIUser Interface

Page 20: Internet Client-Server Systems - York University

Search Engine Architecture:

Crawler

disks

Index

indexing

Search.comQuery: “computer”

look up

Page 21: Internet Client-Server Systems - York University

Web Crawlers

Page 22: Internet Client-Server Systems - York University

Crawler

Crawler

disks

• starts at set of “seed pages”

• fetches pages from the web

• parses fetched pages for hyperlinks

• then follows those links (e.g., BFS)

• variations:

- random walks

- focused crawling

Page 23: Internet Client-Server Systems - York University

Grab

URL DBSeed List

Discovery

Typical Crawler Architecture

Internet

Pagefiles

Filtered Pagefiles IndexPagefiles

Anchor Text DB

Connectivity DB

Duplicates DB

Alias DB

Index Build

Crawler

Page 24: Internet Client-Server Systems - York University

Web Crawler◆ Retrieving Module

◆ Processing Module

◆ Formatting Module

◆ URL Listing Module

◆ The order of traversing

➢ Breadth-first

➢ Depth-first

➢ Better pages first

◆ How frequently the index is updated

Word Wide Web

Database

Retrieving Module

Processing Module

Formatting Module

URL Listing Module

Mining the World Wide Web (pages 107-110)

Page 25: Internet Client-Server Systems - York University

2

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

Page 26: Internet Client-Server Systems - York University

Simple Crawler AlgorithmSimple-Crawler ( S0, D, E )

1 Q S0

2 While Q

3 do u DEQUEUE (Q)

4 d(u) FETCH (u)

5 STORE (D, (d(u), u))

6 L PARSE (d(u))

7 For each v in L

8 Do STORE (E, (u, v))

9 If (v D v Q)

10 Then ENQUEUE (Q, v).

S0 is the seed URL. Q is the “to visit URLs” queue.

L is the set of children URLs of u. D is the “visited URLs” queue.

Page 27: Internet Client-Server Systems - York University

Queuehttp://www.semmel.com

http://www.jhuapl.edu/

http://familysearch.com/

How Web Search Engines Work: Indexing

◆ Place seed URLs into a priority queue

◆ Repeatedly

➢ Select next URL from queue

➢ Fetch page

➢ Characterize page

➢ Store characterization in index

➢ Extract links from page

➢ Assign priority to each link

➢ Add links to queue

The

Web

Indexavocado doc3

doc177baby doc3

doc42doc117

beanie doc42doc77doc193

...

0.943

0.424

Ralph’s Web Page

My favorite color is

lavender!

I collect Beanie

Babies!

See pictures of my

moss garden!

doc42

babybeaniecollectcolor…

http://www.ty.com/

http://www.semmel.com/garden/

...

http://www.semmel.com/

Page 28: Internet Client-Server Systems - York University

doc3

doc42

doc77

doc117

doc193

...

How Web Search Engines Work: Retrieval

◆ Retrieve query from user

◆ Characterize query

◆ Use index to find documents that contain query terms

◆ Measure similarity between query and each potentially relevant document

◆ Sort documents by similarity score

◆ Return documents with highest scores to user

lavender Beanie Babies

Indexavocado doc3

doc177baby doc3

doc42doc117

beanie doc42doc77doc193

...

baby

beanie

lavender

.331 doc3

.924 doc42

.624 doc77

.841 doc117

.118 doc193

...

.924 doc42

.841 doc117

.624 doc77

.331 doc3

.118 doc193

...

Search Results

1. Ralph’s Web Page

2. Ty Homepage

3. Toys R Expensive

4. Caps for Freshmen

5. Bohnanza

6. Ralph’s Lavender Page

7. 404 Not Found

8. Hot Men in Tight Shorts

Page 29: Internet Client-Server Systems - York University

Crawling Issues

◆ How to crawl? ➢ Quality: “Best” pages first

➢ Efficiency: Avoid duplication (or near duplication)

◆ How much to crawl? How much to index?➢ Coverage: How big is the Web? How much do we cover?

➢ Relative Coverage: How much do competitors have?

◆ How often to crawl?➢ Freshness: How much has changed?

➢ How much has really changed?

◆ Visit order and the hidden web

Page 30: Internet Client-Server Systems - York University

Visit Order

◆ Breadth-first: FIFO queue

◆ Depth-first: LIFO queue

◆ Best-first: Priority queue

◆ Random

◆ Refresh rate

Page 31: Internet Client-Server Systems - York University

Breadth First Crawlers

Page 32: Internet Client-Server Systems - York University

Breadth First Crawlers

◆ Use breadth-first search (BFS) algorithm

◆ Get all links from the starting page, and

add them to a queue

◆ Pick the 1st link from the queue, get all

links on the page and add to the queue

◆ Repeat above step till queue is empty

Page 33: Internet Client-Server Systems - York University

Simple Breadth-First Search Crawler

this will eventually download all pages reachable from the start set

insert set of initial URLs into a queue Q

while Q is not empty

currentURL = dequeue(Q)

download page from currentURL

for any hyperlink found in the page

if hyperlink is to a new page

enqueue hyperlink URL into Q

Page 34: Internet Client-Server Systems - York University

Depth First Crawlers

Page 35: Internet Client-Server Systems - York University

Depth First Crawlers

◆ Use depth first search (DFS) algorithm

◆ Get the 1st link not visited from the start

page

◆ Visit link and get 1st non-visited link

◆ Repeat above step till no no-visited links

◆ Go to next non-visited link in the

previous level and repeat 2nd step

Page 36: Internet Client-Server Systems - York University

Traversal strategies: (why BFS?)

• crawl will quickly spread all over the web

• load-balancing between servers

• in reality, more refined strategies (but still BFSish)

• Scripting languages (Python, Perl)

• Java (performance tuning tricky)

• C/C++ with sockets (low-level)

• available crawling tools (usually not scalable)

Tools/languages for implementation:

Page 37: Internet Client-Server Systems - York University

Focused Crawling

Focused Crawler: selectively seeks out pages that are relevant to a pre-defined set of topics.

- Topics specified by using exemplary documents (not keywords)

- Crawl most relevant links

- Ignore irrelevant parts.

- Leads to significant savings in hardware and network resources.

Page 38: Internet Client-Server Systems - York University

Web Indexer

Page 39: Internet Client-Server Systems - York University

Index Issues

◆ How to structure the index

◆ How to create the index (storage, time)

◆ How to store the index (storage, compression)

◆ How to process the index (storage, time)

◆ How to update the index (storage, time)

Page 40: Internet Client-Server Systems - York University

Inverted File Indexing

◆ Inverted file index

➢ contains a list of terms that appear in the document

collection (called a lexicon or vocabulary)

➢ and for each term in the lexicon, stores a list of

pointers to all occurrences of that term in the

document collection. This list is called an inverted list.

Page 41: Internet Client-Server Systems - York University

Inverted File Indexing

◆ Postings file

◆ Inverted file contains➢ Postings: for each term in the lexicon, a list of

pointers to all occurrences of that term in the main text; stored in increasing document ID

➢ Lexicon: mapping from terms to pointer list

Page 42: Internet Client-Server Systems - York University

Lexicon and Postings File

Salmon 5 PTR

<5,23> <12,95> <16,22> <21,12> <25,42>

◼Document 5: ….The extinction of Atlantic salmon is predicted if actions to preserve stocks are not taken…

Page 43: Internet Client-Server Systems - York University

Inverted files

◆ Index information, whether manual or automatic, is stored in an inverted file

◆ Doc. 1: The cat is on the mat

◆ Doc. 2: The mat is on the floor.

1,22Mat

11Floor

11Cat

postingsno. of

occurrences

Page 44: Internet Client-Server Systems - York University

Structure of Inverted Index

◆ Document-level indexing

No. Term Documents

1 cold <2; 1,4>

2 days <2; 3,6>

◆ word-level indexing

1 cold <2;(1:6) ,(4:8)>

Document ID

position ID

Document ID

Page 45: Internet Client-Server Systems - York University

Structure of Inverted Index

◆ May be a hierarchical set of addresses, e.g.

word number within sentence number within paragraph number within chapter number within volume number within document number

◆ Consider as a vector (d,v,c,p,s,w)

Page 46: Internet Client-Server Systems - York University

Compression of Inverted Indexes

◆ Uncompressed, maybe 50 – 100% of

size of text

◆ Compression: store differences rather

than document numbers

➢ E.g. (8:3,5,20,21,23,76,77,78)

→ (8:3,2,15,1,2,53,1,1)

Then code differences using global (for all

lists) or local (for each list) methods

Page 47: Internet Client-Server Systems - York University

Indexing: (Simplified Approach)

(1) scan through all documents

(2) for every work encountered

generate entry (word, doc#, pos)

(3) sort entries by (word, doc#, pos)

(4) now transform into final form

doc1: “Bob reads a book”

doc2: “Alice likes Bob”

doc3: “book”

(bob, 1, 1), (reads, 1, 2), (a, 1, 3)

(book,1, 4), (alice, 2, 1), (likes, 2, 2)

(bob, 2, 3), (book, 3, 1)

(a, 1, 3), (alice, 2, 1), (bob, 1, 1),

(Bob, 2, 3), (book, 1, 4), (book, 3, 1),

(likes, 2, 2), (reads, 1, 2)

a: (1,3)

Alice: (2, 1)

Bob: (1, 1), (2, 3)

book: (1, 4), (3, 1)

likes: (2, 2)

reads: (1, 2)

1-level

Page 48: Internet Client-Server Systems - York University

Improvements

• encode sorted runs by their gaps

significant compression for frequent words!

• less effective if we also store position(adds incompressible lower order bits)

• many highly optimized schemes have been studied

(see Witten/Moffat/Bell)

.

.arm 4, 19, 29, 98, 143, ...

armada 145, 457, 789, ...

armadillo 678, 2134, 3970, ...

armani 90, 256, 372, 511, ...

.

.

.

.arm 4, 15, 10, 69, 45, ...

armada 145, 312, 332, ...

armadillo 678, 1456, 1836, ...

armani 90, 166, 116, 139, ...

.

.

Page 49: Internet Client-Server Systems - York University

• keep data compressed during index construction

• try to keep index in main memory? (altaVista)

• keep important parts in memory? (fancy hits in google)

• use database to store lists? (e.g., Berkeley DB)

Additional issues:

Alternative to inverted index:

• signature files (Bloom filters): false positives

• bitmaps

• better to stick with inverted files (Witten/Moffat/Bell)

Page 50: Internet Client-Server Systems - York University

Standard Web Search Engine Architecture

crawl theweb

create an invertedindex

Check for duplicates,store the documents

Inverted

index

Search

engine

servers

user

query

Show results To user

DocIds

Page 51: Internet Client-Server Systems - York University

How Inverted Files Are Created

◆ Periodically rebuilt, static otherwise.

◆ Documents are parsed to extract tokens. These are saved with the Document ID.

Now is the time

for all good men

to come to the aid

of their country

Doc 1

It was a dark and

stormy night in

the country

manor. The time

was past midnight

Doc 2

Term Doc #

now 1

is 1

the 1

time 1

for 1

all 1

good 1

men 1

to 1

come 1

to 1

the 1

aid 1

of 1

their 1

country 1

it 2

was 2

a 2

dark 2

and 2

stormy 2

night 2

in 2

the 2

country 2

manor 2

the 2

time 2

was 2

past 2

midnight 2

Page 52: Internet Client-Server Systems - York University

How Inverted Files are Created

◆ After all documents have been parsed, the inverted file is sorted alphabetically.

Term Doc #

a 2

aid 1

all 1

and 2

come 1

country 1

country 2

dark 2

for 1

good 1

in 2

is 1

it 2

manor 2

men 1

midnight 2

night 2

now 1

of 1

past 2

stormy 2

the 1

the 1

the 2

the 2

their 1

time 1

time 2

to 1

to 1

was 2

was 2

Term Doc #

now 1

is 1

the 1

time 1

for 1

all 1

good 1

men 1

to 1

come 1

to 1

the 1

aid 1

of 1

their 1

country 1

it 2

was 2

a 2

dark 2

and 2

stormy 2

night 2

in 2

the 2

country 2

manor 2

the 2

time 2

was 2

past 2

midnight 2

Page 53: Internet Client-Server Systems - York University

◆ Multiple term entries for a single document are merged.

◆ Within-document term frequency information is compiled.

Term Doc # Freq

a 2 1

aid 1 1

all 1 1

and 2 1

come 1 1

country 1 1

country 2 1

dark 2 1

for 1 1

good 1 1

in 2 1

is 1 1

it 2 1

manor 2 1

men 1 1

midnight 2 1

night 2 1

now 1 1

of 1 1

past 2 1

stormy 2 1

the 1 2

the 2 2

their 1 1

time 1 1

time 2 1

to 1 2

was 2 2

Term Doc #

a 2

aid 1

all 1

and 2

come 1

country 1

country 2

dark 2

for 1

good 1

in 2

is 1

it 2

manor 2

men 1

midnight 2

night 2

now 1

of 1

past 2

stormy 2

the 1

the 1

the 2

the 2

their 1

time 1

time 2

to 1

to 1

was 2

was 2

How Inverted Files are Created

Page 54: Internet Client-Server Systems - York University

How Inverted Files are Created

◆ Finally, the file can be split into

➢ A Dictionary or Lexicon file

➢ A Postings file

Page 55: Internet Client-Server Systems - York University

How Inverted Files are CreatedDictionary/Lexicon

Term Doc # Freq

a 2 1

aid 1 1

all 1 1

and 2 1

come 1 1

country 1 1

country 2 1

dark 2 1

for 1 1

good 1 1

in 2 1

is 1 1

it 2 1

manor 2 1

men 1 1

midnight 2 1

night 2 1

now 1 1

of 1 1

past 2 1

stormy 2 1

the 1 2

the 2 2

their 1 1

time 1 1

time 2 1

to 1 2

was 2 2

Doc # Freq

2 1

1 1

1 1

2 1

1 1

1 1

2 1

2 1

1 1

1 1

2 1

1 1

2 1

2 1

1 1

2 1

2 1

1 1

1 1

2 1

2 1

1 2

2 2

1 1

1 1

2 1

1 2

2 2

Term N docs Tot Freq

a 1 1

aid 1 1

all 1 1

and 1 1

come 1 1

country 2 2

dark 1 1

for 1 1

good 1 1

in 1 1

is 1 1

it 1 1

manor 1 1

men 1 1

midnight 1 1

night 1 1

now 1 1

of 1 1

past 1 1

stormy 1 1

the 2 4

their 1 1

time 2 2

to 1 2

was 1 2

Postings

Page 56: Internet Client-Server Systems - York University

Implementation Based on Inverted Files

system

computer

database

science D2, 4

D5, 2

D1, 3

D7, 4

Index terms df

3

2

4

1

Dj, tfj

Index file Postings lists

• • •

Page 57: Internet Client-Server Systems - York University

Inverted Indexes

◆ Permit fast search for individual terms

◆ For each term, you get a list consisting of:

➢ document ID

➢ frequency of term in doc (optional)

➢ position of term in doc (optional)

◆ These lists can be used to solve Boolean queries:

➢ country -> d1, d2

➢manor -> d2

➢ country AND manor -> d2

◆ Also used for statistical ranking algorithms

Page 58: Internet Client-Server Systems - York University

Inverted Indexes for Web Search Engines

◆ Inverted indexes are still used, even though the web is so huge.

◆ Some systems partition the indexes across different machines. Each machine handles different parts of the data.

◆ Other systems duplicate the data across many machines; queries are distributed among the machines.

◆ Most do a combination of these.

Page 59: Internet Client-Server Systems - York University

Summary

Page 60: Internet Client-Server Systems - York University

Search Engines

◆ Search engines are the most popular way to locate information online

◆ About 33 million U.S. Internet users query on search engines on a typical day.

◆ More than 80% have used search engines

◆ Search Engines are measured by coverage and recency.

Page 61: Internet Client-Server Systems - York University

Search Engine Architecture

Generic

Crawler

Storage

Server

Index

Server

Graph

Server

BFS-

Crawler

Focused

Crawler

User

Interface

User

Tools

Admin

Interface

Data Acquisition

User Interfaces

Scalable Server Components

W

W

W

WWW

Page 62: Internet Client-Server Systems - York University

Working of a Local Search Engine

Search Form

Indexer

Web Site Documents

Gets

words

IndexStores

Words Looks in

Index

Gets Matches

Sends Query

Search Engine

Results Page

Sends

Formatted

Results

Retrieved Page

User views Retrieved Page

User Selects

required

page

Page 63: Internet Client-Server Systems - York University

Indexing

disks

• parse & build lexicon & build index

• index very large

I/O-efficient techniques needed

“inverted index”

indexing

aardvark 3452, 11437, …..

.

.

.

.

.arm 4, 19, 29, 98, 143, ...

armada 145, 457, 789, ...

armadillo 678, 2134, 3970, ...

armani 90, 256, 372, 511, ...

.

.

.

.

.zebra 602, 1189, 3209, ...

Page 64: Internet Client-Server Systems - York University

Indexing

disks

• how to build an index

- in I/O-efficient manner

- in parallel - later

- …...

• how to compress an index (while building it in situ)

• goal: intermediate size not much larger than final size

“inverted index”

indexing

aardvark 3452, 11437, …..

.

.

.

.

.arm 4, 19, 29, 98, 143, ...

armada 145, 457, 789, ...

armadillo 678, 2134, 3970, ...

armani 90, 256, 372, 511, ...

.

.

.

.

.zebra 602, 1189, 3209, ...

Page 65: Internet Client-Server Systems - York University

• lexicon: set of all “words” encountered

millions in the case of the web

• postings: for each word occurrence -

store index of document where it occurs

• also store position in document? (probably yes)

- increases space for index significantly!

- allows efficient search for phrases

- relative positions of words may be important for ranking

• stop words: common words such as “is”, “a”, “the”

• ignore stop words? (maybe better not)

- saves space in index

- cannot search for “to be or not to be”

• stemming: “runs = run = running” (depends on language)

Basic concepts and choices:

Page 66: Internet Client-Server Systems - York University

Stop Lists

◆ Lists of words which are dropped from processing

◆ Few words to hundreds; may include single letters

◆ E.g. Dialog: AN, AND, BY, FOR, FROM, OF,

THE, TO, WITH

◆ Improve storage efficiency (may be 10 to 50% of

text)

◆ Improve processing efficiency

◆ May cause problems:

➢ “to be or not to be”, “AT&T”

➢ “man of war”, “birds of prey”

Page 67: Internet Client-Server Systems - York University

Stemming

◆ Deals with word variation (morphological variants)

➢ E.g.: Computer, computers, computing, compute,

computed, computational, computationally

➢ ➔ comput

◆ Use a stemming algorithm for conflation

➢ Set of rules applied to each word as it is processed

◆ Simplest: combine singular and plural form

◆ Examples: Porter, Lovins, Paice

Page 68: Internet Client-Server Systems - York University

Simple Stemmer

◆ If a word ends in “ies” but not “eies” or aies”

➢ Then “ies” → “y”

◆ If a word ends in “es” but not “aes” , “ees”, or “oes”

➢ Then “es” → “e”

◆ If a word ends in “s”, but not “us” or “ss”

➢ Then “s” → NULL

◆ (apply only first applicable rule)

◆ e.g. spiders, flies, throes, bees

Page 69: Internet Client-Server Systems - York University

Impact of Stemmers

◆ May decrease index file size up to 50%

◆ Should increase recall at cost of precision

◆ Studies are equivocal; some improvements

found but not marked

◆ May depend on nature of vocabulary

Page 70: Internet Client-Server Systems - York University

Availability of Stemmers

◆ Many on Web, e.g. see

➢ http://www.tartarus.org/~martin/PorterStemmer/index.html

for many encodings of Porter Stemmer

◆ Or http://sourceforge.net/projects/stemmers/

for encodings of the Lovins Stemmer

Page 71: Internet Client-Server Systems - York University

Querying

Boolean queries:

(zebra AND armadillo) OR armani

unions/intersections of lists

look up

aardvark 3452, 11437, …..

.

.

.

.

.arm 4, 19, 29, 98, 143, ...

armada 145, 457, 789, ...

armadillo 678, 2134, 3970, ...

armani 90, 256, 372, 511, ...

.

.

.

.

.zebra 602, 1189, 3209, ...

Page 72: Internet Client-Server Systems - York University

Querying and term-based ranking:

Recall Boolean queries:

(zebra AND armadillo) OR armani

unions/intersections of lists

look up

aardvark 3452, 11437, …..

.

.

.

.

.arm 4, 19, 29, 98, 143, ...

armada 145, 457, 789, ...

armadillo 678, 2134, 3970, ...

armani 90, 256, 372, 511, ...

.

.

.

.

.zebra 602, 1189, 3209, ...