Internet Client-Server Systems - York University
Transcript of Internet Client-Server Systems - York University
Web Crawlers
Definition
Spider = robot = crawler
Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.
• pages containing (fairly unstructured) text
• images, audio, etc. embedded in pages
• structure defined using HTML
(Hypertext Markup Language)
• hyperlinks between pages!
• over 2.9 billion pages
• over 16 billion hyperlinks
a giant graph!
What is the Web? (another view)
• pages reside in servers
• related pages in sites
• local versus global links
• logical vs. physical structure
How is the Web organized?
Web Server(Host)
Web Server(Host)
Web Server(Host)
www.poly.edu
www.cnn.com
www.irs.gov
How the Web Works
Desktop(with browser)
give me the file “/world/index.html”
here is the file: “...”
Web Server
www.cnn.com
Fetching “www.cnn.com/world/index.html”
• more than 2.9 billion pages
• more than 16 billion hyperlinks
• plus images, movies, .. , database content
How do we find pages on the Web?
we need specialized tools for finding
pages and information
Overview of web search tools
• Major search engines
(google, alltheweb, altavista, northernlight, hotbot, excite, go)
• Web directories: (yahoo, open directory project)
• Specialized search engines (cora, csindex, achoo, findlaw)
• Local search engines (for one site)
• Meta search engines (beaucoup, allsearchengines, about)
• Personal search assistants (alexa, zapper)
• Comparison shopping agents (mysimon, dealtime, price)
• Image search (ditto, visoo)
• Natural language questions (askjeeves?, northernlight?)
• Database search (completeplanet, direct, invisibleweb)
Major search engines
Basic structure of a search engine:
Crawler
disks
Index
indexing
Search.comQuery: “computer”
look up
Example #1:
• Ragerank (Brin&Page/Google)
“significance of a page depends on
significance of those referencing it”
• HITS (Kleinberg/IBM)
“Hubs and Authorities”
Link-based ranking techniques
• coverage (need to cover large part of the web)
• good ranking (in the case of broad queries)
• freshness (need to update content)
• user load (up to 3000 queries/sec - Google)
• manipulation (sites want to be listed first)
Challenges for search engines:
need to crawl and store massive data sets
smart informational retrieval techniques
frequent recrawling of content
many queries on massive data
naïve techniques will be exploited quickly
Web directories:
• designing topic hierarchy
• automatic classification: “what is this page about?”
• Yahoo and Open Directory mostly human-based
Topic hierarchy: everything
sports politics healthbusiness
baseball
hockey
soccer
….
foreign
domestic
....
....
....
....
....
Challenges:
Specialized search engines:
• be the best on one particular topic
• use domain-specific knowledge
• limited resources do not crawl the entire web!
• focused crawling techniques
• uses other search engines to answer questions
• ask the right specialized search engine
• combine results from several large engines
• needs to be “familiar” with thousands of engines
Meta search engines:
Personal Search Assistants: (alexa, zapper)
• embedded into browser
• can suggest “related pages”
• search by “highlighting text” can use context
• may exploit individual browsing behavior
• may collect and aggregate browsing information
privacy issues
• crawl the web (alexa), or
use existing search engines (zapper)
Web Search Information System
Web Search Information System
Query
and
Feedback
User
Interface
Crawling
IndexingQuery
Processing
Document Repository
Search Engine
Knowledge Base
Learning
Learning
Inference Engine
Ad Hoc
Information
Large Text (Multimedia) Database Tech.
Data(Text) Mining Tech.
Query and
Feedback
User
Interface
Crawling
IndexingQuery
Processing
Document Repository
Search Engine
Knowledge Base
Learning
Learning
Inference Engine
Ad Hoc
Information
Large Text (Multimedia) Database Tech.
Data(Text) Mining Tech.
Perspective
algorithms
information
systems
information
retrieval
databases
data mining
machine learning
AIUser Interface
Search Engine Architecture:
Crawler
disks
Index
indexing
Search.comQuery: “computer”
look up
Web Crawlers
Crawler
Crawler
disks
• starts at set of “seed pages”
• fetches pages from the web
• parses fetched pages for hyperlinks
• then follows those links (e.g., BFS)
• variations:
- random walks
- focused crawling
Grab
URL DBSeed List
Discovery
Typical Crawler Architecture
Internet
Pagefiles
Filtered Pagefiles IndexPagefiles
Anchor Text DB
Connectivity DB
Duplicates DB
Alias DB
Index Build
Crawler
Web Crawler◆ Retrieving Module
◆ Processing Module
◆ Formatting Module
◆ URL Listing Module
◆ The order of traversing
➢ Breadth-first
➢ Depth-first
➢ Better pages first
◆ How frequently the index is updated
Word Wide Web
Database
Retrieving Module
Processing Module
Formatting Module
URL Listing Module
Mining the World Wide Web (pages 107-110)
2
What is a Crawler?
web
init
get next url
get page
extract urls
initial urls
to visit urls
visited urls
web pages
Simple Crawler AlgorithmSimple-Crawler ( S0, D, E )
1 Q S0
2 While Q
3 do u DEQUEUE (Q)
4 d(u) FETCH (u)
5 STORE (D, (d(u), u))
6 L PARSE (d(u))
7 For each v in L
8 Do STORE (E, (u, v))
9 If (v D v Q)
10 Then ENQUEUE (Q, v).
S0 is the seed URL. Q is the “to visit URLs” queue.
L is the set of children URLs of u. D is the “visited URLs” queue.
Queuehttp://www.semmel.com
http://www.jhuapl.edu/
http://familysearch.com/
How Web Search Engines Work: Indexing
◆ Place seed URLs into a priority queue
◆ Repeatedly
➢ Select next URL from queue
➢ Fetch page
➢ Characterize page
➢ Store characterization in index
➢ Extract links from page
➢ Assign priority to each link
➢ Add links to queue
The
Web
Indexavocado doc3
doc177baby doc3
doc42doc117
beanie doc42doc77doc193
...
0.943
0.424
Ralph’s Web Page
My favorite color is
lavender!
I collect Beanie
Babies!
See pictures of my
moss garden!
doc42
babybeaniecollectcolor…
http://www.ty.com/
http://www.semmel.com/garden/
...
http://www.semmel.com/
doc3
doc42
doc77
doc117
doc193
...
How Web Search Engines Work: Retrieval
◆ Retrieve query from user
◆ Characterize query
◆ Use index to find documents that contain query terms
◆ Measure similarity between query and each potentially relevant document
◆ Sort documents by similarity score
◆ Return documents with highest scores to user
lavender Beanie Babies
Indexavocado doc3
doc177baby doc3
doc42doc117
beanie doc42doc77doc193
...
baby
beanie
lavender
.331 doc3
.924 doc42
.624 doc77
.841 doc117
.118 doc193
...
.924 doc42
.841 doc117
.624 doc77
.331 doc3
.118 doc193
...
Search Results
1. Ralph’s Web Page
2. Ty Homepage
3. Toys R Expensive
4. Caps for Freshmen
5. Bohnanza
6. Ralph’s Lavender Page
7. 404 Not Found
8. Hot Men in Tight Shorts
Crawling Issues
◆ How to crawl? ➢ Quality: “Best” pages first
➢ Efficiency: Avoid duplication (or near duplication)
◆ How much to crawl? How much to index?➢ Coverage: How big is the Web? How much do we cover?
➢ Relative Coverage: How much do competitors have?
◆ How often to crawl?➢ Freshness: How much has changed?
➢ How much has really changed?
◆ Visit order and the hidden web
Visit Order
◆ Breadth-first: FIFO queue
◆ Depth-first: LIFO queue
◆ Best-first: Priority queue
◆ Random
◆ Refresh rate
Breadth First Crawlers
Breadth First Crawlers
◆ Use breadth-first search (BFS) algorithm
◆ Get all links from the starting page, and
add them to a queue
◆ Pick the 1st link from the queue, get all
links on the page and add to the queue
◆ Repeat above step till queue is empty
Simple Breadth-First Search Crawler
this will eventually download all pages reachable from the start set
insert set of initial URLs into a queue Q
while Q is not empty
currentURL = dequeue(Q)
download page from currentURL
for any hyperlink found in the page
if hyperlink is to a new page
enqueue hyperlink URL into Q
Depth First Crawlers
Depth First Crawlers
◆ Use depth first search (DFS) algorithm
◆ Get the 1st link not visited from the start
page
◆ Visit link and get 1st non-visited link
◆ Repeat above step till no no-visited links
◆ Go to next non-visited link in the
previous level and repeat 2nd step
Traversal strategies: (why BFS?)
• crawl will quickly spread all over the web
• load-balancing between servers
• in reality, more refined strategies (but still BFSish)
• Scripting languages (Python, Perl)
• Java (performance tuning tricky)
• C/C++ with sockets (low-level)
• available crawling tools (usually not scalable)
Tools/languages for implementation:
Focused Crawling
Focused Crawler: selectively seeks out pages that are relevant to a pre-defined set of topics.
- Topics specified by using exemplary documents (not keywords)
- Crawl most relevant links
- Ignore irrelevant parts.
- Leads to significant savings in hardware and network resources.
Web Indexer
Index Issues
◆ How to structure the index
◆ How to create the index (storage, time)
◆ How to store the index (storage, compression)
◆ How to process the index (storage, time)
◆ How to update the index (storage, time)
Inverted File Indexing
◆ Inverted file index
➢ contains a list of terms that appear in the document
collection (called a lexicon or vocabulary)
➢ and for each term in the lexicon, stores a list of
pointers to all occurrences of that term in the
document collection. This list is called an inverted list.
Inverted File Indexing
◆ Postings file
◆ Inverted file contains➢ Postings: for each term in the lexicon, a list of
pointers to all occurrences of that term in the main text; stored in increasing document ID
➢ Lexicon: mapping from terms to pointer list
Lexicon and Postings File
Salmon 5 PTR
<5,23> <12,95> <16,22> <21,12> <25,42>
◼Document 5: ….The extinction of Atlantic salmon is predicted if actions to preserve stocks are not taken…
Inverted files
◆ Index information, whether manual or automatic, is stored in an inverted file
◆ Doc. 1: The cat is on the mat
◆ Doc. 2: The mat is on the floor.
1,22Mat
11Floor
11Cat
postingsno. of
occurrences
Structure of Inverted Index
◆ Document-level indexing
No. Term Documents
1 cold <2; 1,4>
2 days <2; 3,6>
◆ word-level indexing
1 cold <2;(1:6) ,(4:8)>
Document ID
position ID
Document ID
Structure of Inverted Index
◆ May be a hierarchical set of addresses, e.g.
word number within sentence number within paragraph number within chapter number within volume number within document number
◆ Consider as a vector (d,v,c,p,s,w)
Compression of Inverted Indexes
◆ Uncompressed, maybe 50 – 100% of
size of text
◆ Compression: store differences rather
than document numbers
➢ E.g. (8:3,5,20,21,23,76,77,78)
→ (8:3,2,15,1,2,53,1,1)
Then code differences using global (for all
lists) or local (for each list) methods
Indexing: (Simplified Approach)
(1) scan through all documents
(2) for every work encountered
generate entry (word, doc#, pos)
(3) sort entries by (word, doc#, pos)
(4) now transform into final form
doc1: “Bob reads a book”
doc2: “Alice likes Bob”
doc3: “book”
(bob, 1, 1), (reads, 1, 2), (a, 1, 3)
(book,1, 4), (alice, 2, 1), (likes, 2, 2)
(bob, 2, 3), (book, 3, 1)
(a, 1, 3), (alice, 2, 1), (bob, 1, 1),
(Bob, 2, 3), (book, 1, 4), (book, 3, 1),
(likes, 2, 2), (reads, 1, 2)
a: (1,3)
Alice: (2, 1)
Bob: (1, 1), (2, 3)
book: (1, 4), (3, 1)
likes: (2, 2)
reads: (1, 2)
1-level
Improvements
• encode sorted runs by their gaps
significant compression for frequent words!
• less effective if we also store position(adds incompressible lower order bits)
• many highly optimized schemes have been studied
(see Witten/Moffat/Bell)
.
.arm 4, 19, 29, 98, 143, ...
armada 145, 457, 789, ...
armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.arm 4, 15, 10, 69, 45, ...
armada 145, 312, 332, ...
armadillo 678, 1456, 1836, ...
armani 90, 166, 116, 139, ...
.
.
• keep data compressed during index construction
• try to keep index in main memory? (altaVista)
• keep important parts in memory? (fancy hits in google)
• use database to store lists? (e.g., Berkeley DB)
Additional issues:
Alternative to inverted index:
• signature files (Bloom filters): false positives
• bitmaps
• better to stick with inverted files (Witten/Moffat/Bell)
Standard Web Search Engine Architecture
crawl theweb
create an invertedindex
Check for duplicates,store the documents
Inverted
index
Search
engine
servers
user
query
Show results To user
DocIds
How Inverted Files Are Created
◆ Periodically rebuilt, static otherwise.
◆ Documents are parsed to extract tokens. These are saved with the Document ID.
Now is the time
for all good men
to come to the aid
of their country
Doc 1
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Doc 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2
How Inverted Files are Created
◆ After all documents have been parsed, the inverted file is sorted alphabetically.
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2
◆ Multiple term entries for a single document are merged.
◆ Within-document term frequency information is compiled.
Term Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2
How Inverted Files are Created
How Inverted Files are Created
◆ Finally, the file can be split into
➢ A Dictionary or Lexicon file
➢ A Postings file
How Inverted Files are CreatedDictionary/Lexicon
Term Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Doc # Freq
2 1
1 1
1 1
2 1
1 1
1 1
2 1
2 1
1 1
1 1
2 1
1 1
2 1
2 1
1 1
2 1
2 1
1 1
1 1
2 1
2 1
1 2
2 2
1 1
1 1
2 1
1 2
2 2
Term N docs Tot Freq
a 1 1
aid 1 1
all 1 1
and 1 1
come 1 1
country 2 2
dark 1 1
for 1 1
good 1 1
in 1 1
is 1 1
it 1 1
manor 1 1
men 1 1
midnight 1 1
night 1 1
now 1 1
of 1 1
past 1 1
stormy 1 1
the 2 4
their 1 1
time 2 2
to 1 2
was 1 2
Postings
Implementation Based on Inverted Files
system
computer
database
science D2, 4
D5, 2
D1, 3
D7, 4
Index terms df
3
2
4
1
Dj, tfj
Index file Postings lists
• • •
Inverted Indexes
◆ Permit fast search for individual terms
◆ For each term, you get a list consisting of:
➢ document ID
➢ frequency of term in doc (optional)
➢ position of term in doc (optional)
◆ These lists can be used to solve Boolean queries:
➢ country -> d1, d2
➢manor -> d2
➢ country AND manor -> d2
◆ Also used for statistical ranking algorithms
Inverted Indexes for Web Search Engines
◆ Inverted indexes are still used, even though the web is so huge.
◆ Some systems partition the indexes across different machines. Each machine handles different parts of the data.
◆ Other systems duplicate the data across many machines; queries are distributed among the machines.
◆ Most do a combination of these.
Summary
Search Engines
◆ Search engines are the most popular way to locate information online
◆ About 33 million U.S. Internet users query on search engines on a typical day.
◆ More than 80% have used search engines
◆ Search Engines are measured by coverage and recency.
Search Engine Architecture
Generic
Crawler
Storage
Server
Index
Server
Graph
Server
BFS-
Crawler
Focused
Crawler
User
Interface
User
Tools
Admin
Interface
Data Acquisition
User Interfaces
Scalable Server Components
W
W
W
WWW
Working of a Local Search Engine
Search Form
Indexer
Web Site Documents
Gets
words
IndexStores
Words Looks in
Index
Gets Matches
Sends Query
Search Engine
Results Page
Sends
Formatted
Results
Retrieved Page
User views Retrieved Page
User Selects
required
page
Indexing
disks
• parse & build lexicon & build index
• index very large
I/O-efficient techniques needed
“inverted index”
indexing
aardvark 3452, 11437, …..
.
.
.
.
.arm 4, 19, 29, 98, 143, ...
armada 145, 457, 789, ...
armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.zebra 602, 1189, 3209, ...
Indexing
disks
• how to build an index
- in I/O-efficient manner
- in parallel - later
- …...
• how to compress an index (while building it in situ)
• goal: intermediate size not much larger than final size
“inverted index”
indexing
aardvark 3452, 11437, …..
.
.
.
.
.arm 4, 19, 29, 98, 143, ...
armada 145, 457, 789, ...
armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.zebra 602, 1189, 3209, ...
• lexicon: set of all “words” encountered
millions in the case of the web
• postings: for each word occurrence -
store index of document where it occurs
• also store position in document? (probably yes)
- increases space for index significantly!
- allows efficient search for phrases
- relative positions of words may be important for ranking
• stop words: common words such as “is”, “a”, “the”
• ignore stop words? (maybe better not)
- saves space in index
- cannot search for “to be or not to be”
• stemming: “runs = run = running” (depends on language)
Basic concepts and choices:
Stop Lists
◆ Lists of words which are dropped from processing
◆ Few words to hundreds; may include single letters
◆ E.g. Dialog: AN, AND, BY, FOR, FROM, OF,
THE, TO, WITH
◆ Improve storage efficiency (may be 10 to 50% of
text)
◆ Improve processing efficiency
◆ May cause problems:
➢ “to be or not to be”, “AT&T”
➢ “man of war”, “birds of prey”
Stemming
◆ Deals with word variation (morphological variants)
➢ E.g.: Computer, computers, computing, compute,
computed, computational, computationally
➢ ➔ comput
◆ Use a stemming algorithm for conflation
➢ Set of rules applied to each word as it is processed
◆ Simplest: combine singular and plural form
◆ Examples: Porter, Lovins, Paice
Simple Stemmer
◆ If a word ends in “ies” but not “eies” or aies”
➢ Then “ies” → “y”
◆ If a word ends in “es” but not “aes” , “ees”, or “oes”
➢ Then “es” → “e”
◆ If a word ends in “s”, but not “us” or “ss”
➢ Then “s” → NULL
◆ (apply only first applicable rule)
◆ e.g. spiders, flies, throes, bees
Impact of Stemmers
◆ May decrease index file size up to 50%
◆ Should increase recall at cost of precision
◆ Studies are equivocal; some improvements
found but not marked
◆ May depend on nature of vocabulary
Availability of Stemmers
◆ Many on Web, e.g. see
➢ http://www.tartarus.org/~martin/PorterStemmer/index.html
for many encodings of Porter Stemmer
◆ Or http://sourceforge.net/projects/stemmers/
for encodings of the Lovins Stemmer
Querying
Boolean queries:
(zebra AND armadillo) OR armani
unions/intersections of lists
look up
aardvark 3452, 11437, …..
.
.
.
.
.arm 4, 19, 29, 98, 143, ...
armada 145, 457, 789, ...
armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.zebra 602, 1189, 3209, ...
Querying and term-based ranking:
Recall Boolean queries:
(zebra AND armadillo) OR armani
unions/intersections of lists
look up
aardvark 3452, 11437, …..
.
.
.
.
.arm 4, 19, 29, 98, 143, ...
armada 145, 457, 789, ...
armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.zebra 602, 1189, 3209, ...