CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval...
Transcript of CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval...
![Page 1: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/1.jpg)
CS 347 Notes10 1
CS 347 Parallel and Distributed
Data Processing
DistributedInformation Retrieval
Hector Garcia-MolinaZoltan Gyongyi
![Page 2: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/2.jpg)
CS 347 Notes10 2
Web Search Engine
• Crawling• Indexing• Computing ranking features• Serving queries
![Page 3: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/3.jpg)
CS 347 Notes10 3
Crawling• Fetch content of web pages
web
init
get next URL
get page
extract URLs
seed URLs
URLs to visit
visited URLs
web pages
![Page 4: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/4.jpg)
CS 347 Notes10 4
Issues
• Scope and freshness– Not enough space/time to crawl “all”
pages– Page importance, quality, and update
frequency– Site mirrors and (near) duplicate pages– Dynamic content and crawler traps
• Load at visited web sites– Rules in robots.txt– Limit number of visits per day– Limit depth of crawl
![Page 5: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/5.jpg)
CS 347 Notes10 5
Issues
• Load at crawler– Variance of fetch latency/bandwidth– Parallelization and scalability
Multiple agents Partitioning URL lists Communication between agents Recovering from agent failure
![Page 6: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/6.jpg)
CS 347 Notes10 6
Crawl Partitioning
• Requirements– Each URL assigned to a single agent– Locally computable URL-to-agent
mapping– Balanced distribution of URLs across
agents– Contravariance
![Page 7: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/7.jpg)
CS 347 Notes10 7
Contravariance
Agent A
url1url3url5
Agent B
url2url4url6
Agent A
url1url2
Agent B
url3
url4
Agent C
url5url6
![Page 8: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/8.jpg)
CS 347 Notes10 8
Contravariance
Agent A
url1url3url5
Agent B
url2url4url6
Agent A
url1url2
Agent B
url3
url4
Agent C
url5url6
Agent A
url1url3
Agent B
url2url4
Agent C
url5url6
![Page 9: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/9.jpg)
CS 347 Notes10 9
Assignment
• Consistent hashing– Hash function: URL agent– Each agent “replicated” k times– Each replica mapped randomly on unit
circle Mapping persistent across agent restarts
– Lookup: map URL on unit circle; find closest live replica
![Page 10: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/10.jpg)
CS 347 Notes10 10
Assignment
A
AB
B
url6
![Page 11: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/11.jpg)
CS 347 Notes10 11
Assignment
A
AB
B
url6 A
AB
B
url6
C
C
• Balancing • Contravariance
![Page 12: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/12.jpg)
CS 347 Notes10 12
Crawl Partitioning
• Ideas– URL normalization
E.g., relative to absolute URL
– Host-based partitioning Reduces communication between agents Small vs. large hosts
– Geographic distribution
![Page 13: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/13.jpg)
CS 347 Notes10 13
Fault Tolerance
• Repartitioning • Permanent failure
– Recovering list of URLs to visit Checkpoints Communication logs
• Transient failure– Avoiding re-visiting URLs
Before fetch, check with near neighbor agents
![Page 14: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/14.jpg)
CS 347 Notes10 14
Indexing
• Build term-document index
● ● ●
● ●
● ●
●
● ●
d1 d2 d3 d4 d5 d6 dn
t1
t2
t3
t4
t5
●t6
● ●tm
Posting for t1
Lexicon
Collection
![Page 15: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/15.jpg)
CS 347 Notes10 15
Architecture
Web pages
Distributors Indexers Query servers
Intermediate runs
Inverted index files
Reduce
Map
![Page 16: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/16.jpg)
CS 347 Notes10 16
Issues
• Index partitioning– Efficient query processing
Query routing Result retrieval
![Page 17: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/17.jpg)
CS 347 Notes10 17
Document Partitioningd1 d2 d3 d4 d5 d6 dn
t1
t2
t3
t4
t5
t6
tm
d1 d2 d3
t1
t2
t3
t4
t5
t6
tm
d4 d5 d6
t1
t2
t3
t4
t5
t6
tm
dn-2 dn-1 dn
t1
t2
t3
t4
t5
t6
tm
![Page 18: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/18.jpg)
CS 347 Notes10 18
Document Partitioning
• Split the collection of documents• Advantages
– Easy to add new documents– Load balanced– High processing throughput
• Disadvantages– Communication with all query servers
![Page 19: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/19.jpg)
CS 347 Notes10 19
Term Partitioning
d1 d2 d3 d4 d5 d6 dn
t1
t2
t3
t4
t5
t6
tm
d1 d2 d3 d4 d5 d6 dn
t1
t2
t3
d1 d2 d3 d4 d5 d6 dn
t4
t5
t6
d1 d2 d3 d4 d5 d6 dn
tm-2
tm-1
tm
![Page 20: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/20.jpg)
CS 347 Notes10 20
Term Partitioning
• Split the lexicon• Advantages
– Reduced communication with query servers
• Disadvantages– More processing before partitioning– Adding new documents is hard– Load balancing is hard– Processing throughput limited by query
length
![Page 21: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/21.jpg)
CS 347 Notes10 21
Advanced Partitioning
• Topical partitioning using clustering– Documents clustered by term-similarity– Partitions made up of one or more
clusters
• Usage-induced partitioning– Queries extracted from logs– Documents clustered by query-similarity– Partitions made up of one or more
clusters
![Page 22: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/22.jpg)
CS 347 Notes10 22
Ranking Feature Computation
• Parallel/distributed computation tasks– Text/language processing– Document classification/clustering– Web graph analysis
![Page 23: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/23.jpg)
CS 347 Notes10 23
Example: PageRank
• Link-based global (query-independent) importance metric
• Random surfer model– Start at a random page– With probability d, navigate to new page
by following a random link on current page
– With probability (1 – d), restart at a random pagePageRank score = expected
fraction of time spent at a page
![Page 24: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/24.jpg)
CS 347 Notes10 24
Formula
p(x) = d ∙ Σ p(y) / out(y) + (1 – d) / n yx
![Page 25: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/25.jpg)
CS 347 Notes10 25
Formula
p(x) = d ∙ Σ p(y) / out(y) + (1 – d) / n
PageRank of page x
yx
PageRank of y, where y links to x
Out-degree of page y
Probability of random restart at x
![Page 26: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/26.jpg)
CS 347 Notes10 26
Algorithm
i = 0p[i](x) = (1 – d) / nrepeat
i += 1p[i](x) = (1 – d) / nfor all yx
p[i](x) += d ∙ p[i–1](y) / out(y)until | p[i] – p[i–1] | < ε
![Page 27: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/27.jpg)
CS 347 Notes10 27
Implementation
• Two vectors, current and next• Initialize vectors• Iterate over all pages y, distribute
PageRank from current(y) to next(x) for all links yx
• current = next, re-initialize next• Go back to iteration over pages or
stop
![Page 28: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/28.jpg)
CS 347 Notes10 28
Distribution
• MapReduce for each iteration i• Map
– Take <y, (current(y), edges(y))>– For each yx in edges(y)
emit <x, current(y) / | edges(y) |>– Also emit <y, edges(y)>
• Reduce– Take <x, val> and <x, edges(x)>– Sum (d ∙ val) into next(x), add (1 – d) / n– Emit <x, (next(x), edges(x))>
![Page 29: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/29.jpg)
CS 347 Notes10 29
Distribution
<y, (current(y), edges(y))>
Map
<x, val>
<x, val>
Reduce
<x, (next(x), edges(x))>
![Page 30: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/30.jpg)
CS 347 Notes10 30
Query Processing
• Locate, retrieve, process, and serve query results
Inverted index files
Query coordinat
orQuery
servers
Cache
Query
Results
![Page 31: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/31.jpg)
CS 347 Notes10 31
Architecture
• Multiple sites connected by WAN– Site = coordinator + servers + cache
• Partitioning– Parallel processing– Distributed storage of data– E.g., index partitioning
• Replication– Availability– Throughput– Response time
![Page 32: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/32.jpg)
CS 347 Notes10 32
Issues
• Routing the query– To sites
E.g., identical sites + routing by dynamic DNS lookup
– Within sites
• Merging the results• Caching
![Page 33: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/33.jpg)
CS 347 Notes10 33
Issues
Routing Merging
Document partition
All servers
Results selected by servers; ranking by coordinator
Term partition
Servers containing query terms
Selection and ranking by coordinator
![Page 34: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/34.jpg)
CS 347 Notes10 34
Caching
• What to cache?– Query answers– Term postings
![Page 35: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/35.jpg)
CS 347 Notes10 35
Caching
• What to cache?– Query answers
Faster response
– Term postings More hits
Query terms repeated more frequently than whole queries
![Page 36: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/36.jpg)
CS 347 Notes10 36
Caching Policy
• Terms most frequent in queries high hit ratio
• Terms most frequent in documents require more cache space
(longer postings)• Use static caching based on
query/document frequency ratio
![Page 37: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.](https://reader036.fdocuments.net/reader036/viewer/2022062722/56649f335503460f94c5033f/html5/thumbnails/37.jpg)
CS 347 Notes10 37
Summary
• Crawling– Partitioning: balancing and contravariance– Consistent hashing
• Indexing– Document, term, topical, and usage-induced
partitioning
• Computing ranking features– PageRank with MapReduce
• Serving queries– Routing queries, merging results, and caching
postings