Post on 15-Jan-2016
Peer-to-Peer Information Search
Sebastian Michel
Ecole Polytechnique Fédérale Lausanne
Lausanne - Switzerland
Josiane Xavier Parreira
Max-Planck Institute for Informatics
Saarbrücken - Germany
Peer-to-Peer Information Search - SBBD 2007 Tutorial 221/04/23
Outline of Part 1 Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result
Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
Peer-to-Peer Information Search - SBBD 2007 Tutorial 321/04/23
P2P Systems
Known from Napster and others Sharing of mostly illegal content (mp3, movies)
P2P= Pirate-to-Pirate ?? New kind of network organization; no client/server anymore Basic Ideas:
Each peer connects to a few other peers All peers together form powerful networks
Potential Benefits: No single point of failure Load is spread across mulitple peers (Resilient to failures and dynamics)
Peer: “one that is of equal standing with another”
(source: Merriam-Webster Online Dictionary )
Peer-to-Peer Information Search - SBBD 2007 Tutorial 421/04/23
Napster
Publish fil
e statis
tics
File Download
File
Dow
nloa
d• Central server (index)• Client software sends informationabout users‘ contents to server.• User send queries to server• Server responds with IP of users that store matching files. Peer-to-Peer file sharing!
• Developed in 1998.• First P2P file-sharing system
Peer-to-Peer Information Search - SBBD 2007 Tutorial 521/04/23
Gnutella Protocol for distributed file sharing Started in 2000 in 2005: 1.81 million computers connected*
Unstructured Network Truly decentralized Uses message flooding during query execution. Later: version with super nodes and query routing
* http://www.slyck.com/news.php?story=814
Peer-to-Peer Information Search - SBBD 2007 Tutorial 621/04/23
Gnutella Style
Paris Hilton?
TTL 3
TTL 3
TTL 2
TTL 2
TTL 2TTL 1
TTL 0TTL 1
TTL 1
TTL 0
Peer-to-Peer Information Search - SBBD 2007 Tutorial 721/04/23
Gnutella Style Pros:
no complex statistical bookkeeping
Cons: lot of network traffic some peers might not be
reachable (TTL)
Peer-to-Peer Information Search - SBBD 2007 Tutorial 821/04/23
Bit Torrent Idea: Load sharing through file splitting A lot of (legal) software distributors offer software through Bit-torrent Download information in small .torrent file One tracker node per file (specified in torrent file)
segment 1
segment 2
segment 3
segment 4
segment 5
tracker node
Client
segment 1
segment 3
segment 5
segment 4
segment 2
request randompeer list
requestsegments
File
Incentives: „tit-for-tat“Each peer remembers collaborative peers different priorities
Peer-to-Peer Information Search - SBBD 2007 Tutorial 921/04/23
Literature Book: Peer-to-Peer: Harnessing the Power of Disruptive
Technologies by Andy Oram. O'Reilly Media, Inc.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 1021/04/23
Overlay Networks On top of existing networks
Different way to build an overlay network structured unstructured hybrid
Peer-to-Peer Information Search - SBBD 2007 Tutorial 1121/04/23
Self* Properties (Promises) Self-Organizing:
evolves, grows..... without being guided/managed
Self-Optimizing Self-Configuring Self-Healing:
Self-Restoration Self-Diagnostics
Self-Protecting
Peer-to-Peer Information Search - SBBD 2007 Tutorial 1221/04/23
Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result
Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
Peer-to-Peer Information Search - SBBD 2007 Tutorial 1321/04/23
Distributed Hash Tables
Hash-Table: given a key, return the bucket id. Based on a hash function (like SHA-1)
Now: Distributed. For a given key, return the id of the peer currently responsible for the key.
Challenge: Purely distributed protocols that cope with node failures, departures, arrivals.
No central manager.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 1421/04/23
p1
p8
p14
p21
p32p38
p42
p48
p51
p56
Chord uses an m-bit identifier space
ordered in a mod-2m circle, the Chord ring;
maps peers and objects to identifiers in the Chord ring, using the hash function SHA-1
uses consistent hashing: an object with identifier id is
placed on the successor peer, succ(id), which is the first node whose identifier is equal to, or follows id on the Chord ring
Key k (e.g., hash(file name))is assigned to the node withkey p (e.g., hash(IP address))such that k p and there isno node p‘ with k p‘ and p‘<p
k10
k24
k30k38
k54
Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, Hari Balakrishnan: Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM 2001: 149-160
Peer-to-Peer Information Search - SBBD 2007 Tutorial 1521/04/23
Chordpeer n maintains routing
information about peers that lie on the Chord ring at logarithmically increasing distance
Finger tables
Chord Ring
p1
p8
p56
p51
p48
p42
p38 p32p21
p14
p8 + 4
p8 + 8
p8 + 16
p8 + 2
p8 + 32
p8 + 1
p14
p21
p32
p14
p42
p14
fingertablep8
p42 + 4
p42 + 8
p42 + 16
p42 + 2
p42 + 32
p42 + 1
p48
p51
p1
p48
p14
p48
fingertablep42
p51 + 4
p51 + 8
p51 + 16
p51 + 2
p51 + 32
p51 + 1
p56
p1
p8
p56
p21
p56
fingertablep51
k54
Lookup(54)
Peer-to-Peer Information Search - SBBD 2007 Tutorial 1621/04/23
Node Joins in Chord
p48
p38
p42
k40
k43
k39
p42 lookup(42)
k40
k39
sets succ pointerp42
moving keys
updates succ pointerp38
init_finger_tables()successor=node.find_successor()predecessor=successor.predecessorpredecessor.successor=new
Peer-to-Peer Information Search - SBBD 2007 Tutorial 1721/04/23
And others ... P-Grid: Karl Aberer: P-Grid: A Self-Organizing Access Structure for
P2P Information Systems. CoopIS 2001: 179-194 CAN: Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M.
Karp, Scott Shenker: A scalable content-addressable network. 161-172
Pastry: Antony I. T. Rowstron, Peter Druschel: Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. Middleware 2001: 329-350
Bamboo: Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz. Handling Churn in a DHT. Proceedings of the USENIX Annual Technical Conference, June 2004.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 1821/04/23
Range queries Range queries
A range query [v1, v2] searches for those peers which store data whit key value k [v1, v2]
DHTs only support efficiently exact-match queries
The naïve approach to process range queries in DHTs is to:
query each value of a range individually
It is HIGHLY EXPENSIVE!
Peer-to-Peer Information Search - SBBD 2007 Tutorial 1921/04/23
DHTs and Range Queries
There are two main solutions to cope with load imbalances i.e. to perform load balancing:
transferring load, or replicating data
Order preserving hash function: mkkf 2*
minmax
min:)(
usually leads skeweddistributions
Peer-to-Peer Information Search - SBBD 2007 Tutorial 2021/04/23
DHT and Range Queries (2) Existing approaches to deal with range queries:
Locality preserving hashing OP-Chord: Triantafillou et al (2003). Skip Graphs: Aspnes et al (2004)
Hashing ranges of values instead of each value individually CAN-based: Andrzejak et al (2002), Sahin et al (2004)
Another problem in that context: access load imbalances
One possible solution: “hot data” transferring to deal with those load imbalances
However, data transfer does not solve access load imbalances in skewed access (query) distributions
Peer-to-Peer Information Search - SBBD 2007 Tutorial 2121/04/23
HotRod: replicating hot arcs Theoni Pitoura et al. EDBT 2006.
A peer is “hot” (or overloaded) when > _max, where _max is the upper limit of its resource capacityAn arc of peers is “hot” when at least one of its peers is hot
replicate ranges of values
Peer-to-Peer Information Search - SBBD 2007 Tutorial 2221/04/23
Efficient Load BalancingLorenz Curves for Access Load Distribution
(r = 200, θ = 0.8)
97%; 90%
97%; 40%
97%; 73%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
cumulative percentage of peers
cu
mu
lati
ve
pe
rce
nta
ge
of
hit
s
HotRoD OP-Chord Chord Line of uniformity
HotRoD
OP-Chord
Line of uniformity
Chord
97%
Peer-to-Peer Information Search - SBBD 2007 Tutorial 2321/04/23
Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result
Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
Peer-to-Peer Information Search - SBBD 2007 Tutorial 2421/04/23
Building a P2P Search Engine(Peer to Peer Information Retrieval)
“Distributed Google”
P2P approach best suitable large number of peers exploit mostly idle resources intellectual input of user community scalable and self organizing
Peer-to-Peer Information Search - SBBD 2007 Tutorial 2521/04/23
Information Retrieval Basics
Document Terms
5 x
7 x
4 x
# of terms(term frequency)
Peer-to-Peer Information Search - SBBD 2007 Tutorial 2621/04/23
Information Retrieval Basics (2)
index lists with(DocId: tf*idf)sorted by Score
B+ tree on terms
Query Execution: Usually using some kind of threshold algorithm*: - sequential scans over the index lists (round-robin) - (random accesses to fetch missing scores) - aggregate scores - stop when the threshold is reached
Top-k Query Processing: find k documents with the highest total score
e.g. Fagin’s algorithmTA or a variant without random accesses
d17: 0.3d44: 0.4
...d52: 0.1
d53: 0.8d55: 0.6 d12: 0.5
d14: 0.4
...
d28: 0.1
d51: 0.6
d52: 0.3
d28: 0.7
...
d17: 0.1
d44: 0.2
d11: 0.6
Peer-to-Peer Information Search - SBBD 2007 Tutorial 2721/04/23
Going distributed: Index Organization
peer index every peer has its own collection (full
documents) distributed index = index of peer
descriptions
document index
d17: 0.3d44: 0.4
...
d52: 0.1
d53: 0.8d55: 0.6 d12: 0.5
d14: 0.4...
d28: 0.1
d51: 0.6
d52: 0.3d44: 0.2
d28: 0.7
...
d17: 0.1d11: 0.6
Peer 1 Peer 2 Peer 3
Peer 2
Peer 1
Peer-to-Peer Information Search - SBBD 2007 Tutorial 2821/04/23
(Full) Document Index Straight forward from centralized document index Each peer is responsible for storing the index list for a
subset of terms.p1
p8
p14
p21
p32p38
p42
p48
p51
p56Query Routing: DHT lookupsQuery Execution: Distributed Top-k [TPUT ’04, KLEE ‘05]
Peer-to-Peer Information Search - SBBD 2007 Tutorial 2921/04/23
Peer Index Each peer has its own local index
(e.g., created by web crawls) Peers publish compact per-term
descriptions about their index
Query Routing: 1. DHT lookups2. Retrieve Metadata3. Find most promising peers
Query Execution: - Send the complete Query and merge the incoming results
a: P1 P6 P4
b: P5 P3 P1 P6 ...
Distributed Directory
Term List of Peers
P1
P5
P6 P4
P2
P3
Peer-to-Peer Information Search - SBBD 2007 Tutorial 3021/04/23
P2P Search with Minerva
book-marksB0
term g: 13, 11, 45, ...term a: 17, 11, 92, ...term f: 43, 65, 92, ...
peer lists (directory)
term g: 13, 11, 45, ...
term c: 13, 92, 45, ...url x: 37, 44, 12, ...
url y: 75, 43, 12, ...
url z: 54, 128, 7, ...
query peer P0
Query routing aims to optimize benefit/cost driven by distributed statistics on peers‘ content quality, content overlap, freshness, authority, trust, etc.
Maintain semantic/social/statistical overlay network (SON)
local index X0
based onscalable,churn-resilientDHT withO(log n) key lookup
peer ranking& statistics
peer ranking& statistics
Exploit community behavior (bookmarks, links, tags, clicks, etc.)
Peer-to-Peer Information Search - SBBD 2007 Tutorial 3121/04/23
Two major Problems Task of merging the obtained results into final ranking:
Result Merging
Task of finding “high quality“ peers: Query Routing aka database/collection/peer selection
Overview articles: J. Callan. (2000). "Distributed information retrieval." In W. B.
Croft, editor, Advances in Information Retrieval. Kluwer Academic Publishers. (pp. 127-150).
Weiyi Meng, Clement T. Yu, King-Lup Liu: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1): 48-89 (2002)
Peer-to-Peer Information Search - SBBD 2007 Tutorial 3221/04/23
Query Routing Given a Query Q={term1, term2, ...., termN): select the
most promising peers Based on:
per-term per-peer statistics document frequency vocabulary size
+ normalization issues like collection frequency avg vocabulary size
Most popular: CORI, GlOSS, Decision Theoretic Framework (DTF)
Peer-to-Peer Information Search - SBBD 2007 Tutorial 3321/04/23
CORI
cwavgcwdf
dftpT
ptp
tp
_/*15050),(
,
,
)(*),(*)1(),( jjiji tItpTbbtps
)0.1log(
)5.0
log(
)(
C
cfC
tI t
p1 p2 pj-1 pj
t1 t2 t3 tk
....
Apply document ranking to resource ranking
q
Query
Resources
Terms
Qt
jii
j
tpsn
QpS ),(1
),(
C = #peersdf = document frequency
cf = collection frequencycw = # distinct words per peer
Peer-to-Peer Information Search - SBBD 2007 Tutorial 3421/04/23
Literature J. Callan. (2000). "Distributed information retrieval." In W.
B. Croft, editor, Advances in Information Retrieval. Kluwer Academic Publishers. (pp. 127-150).
Weiyi Meng, Clement T. Yu, King-Lup Liu: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1): 48-89 (2002)
CORI: James P. Callan, Zhihong Lu, W. Bruce Croft: Searching Distributed Collections with Inference Networks. SIGIR 1995: 21-28
GlOSS: Luis Gravano, Hector Garcia-Molina, Anthony Tomasic: GlOSS: Text-Source Discovery over the Internet. ACM Trans. Database Syst. 24(2): 229-264 (1999)
Decision Theoretic Framework: Norbert Fuhr: A Decision-Theoretic Approach to Database Selection in Networked IR. ACM Trans. Inf. Syst. 17(3): 229-249 (1999)
Peer-to-Peer Information Search - SBBD 2007 Tutorial 3521/04/23
Problem: incomparable scores Different corpus statistics
df component used in tf*ids scoring functions is not globally known
user with lot of high quality documents for term a high df non expert user with some bad documents for term a low df
Result Merging
Different scoring functions completely different functions different parameters in the same function
Peer-to-Peer Information Search - SBBD 2007 Tutorial 3621/04/23
Result Merging Approaches Score Normalization by
using global statistics computation of global statistics difficult (not obvious) solution using gossip
score re-computation with query initiator‘s local statistics
required re-ranking and knowledge about document contents
score re-computation using query routing scores routing score available anyway
4.1
''**4.0'''
)/()('
)/()('
minmaxmin
minmaxmin
i
i
ii
RDDD
DDDDD
RRRRR
iii
Peer-to-Peer Information Search - SBBD 2007 Tutorial 3721/04/23
Global DF Estimation
gdf (global doc. freq.) of a term is interesting key measure,but overlap among peers makes simple distr. counting infeasible
hash sketches [Flajolet/Martin 1985]: duplicate-sensitive cardinality estimator for multisets
hash each multiset element x onto m-bit bitvector and remember least significant 1 bit rough intuition: least-significant bit set by half of the documents,
second bit by ¼ of the documents...... Theory says: most significant bit estimator of log (n); n=#documents Higher accuracy: average multiple iid sketches
Peer-to-Peer Information Search - SBBD 2007 Tutorial 3821/04/23
Global DF EstimationHash sketches of different peers collected at directory peerdistributivity is free!! i {(h(x)) | x Si} = {(h(x)) | x i Si}
gdf estimation algorithm: each peer p posts hash sketch for each (discriminative) term t to
directory directory peer for term t forms union of incoming hash sketches when a peer needs to know gdf(t), simply ask directory peer for t sliding-window techniques for dynamic adjustment
Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum: Global Document Frequency Estimation in Peer-to-Peer Web Search. WebDB 2006
Peer-to-Peer Information Search - SBBD 2007 Tutorial 3921/04/23
Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result
Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
Peer-to-Peer Information Search - SBBD 2007 Tutorial 4021/04/23
Autonomous Peers Overlapping Sources
A
C
E
D
B
?
?
quer
ying
peer
1 2 3 4
Rec
all
#peers
{A} {A,B} {A,B,C} {A,..,D}
overlap aware routing strategy:
1 2
Rec
all
#peers
{A} {A,E}
Peer-to-Peer Information Search - SBBD 2007 Tutorial 4121/04/23
How? Enrich published statistics with overlap
estimators. Interested in NOVELTY and QUALITY Iterative greedy selection process
select first peer based on quality select next peer by quality*novelty
Suitable synopses for overlap estimation: Bloom filter [Bloom 1979] hash sketches [Flajolet&Martin 1985] min wise independent permutations [Broder
1997]
Peer-to-Peer Information Search - SBBD 2007 Tutorial 4221/04/23
Min-Wise Independent Permutations [Broder 97]
MIPs are unbiased estimator of overlap: P [min {h(x) | xA} = min {h(y) | yB}] = |AB| / |AB|
set of ids 17 21 3 12 24 8
20 48 24 36 18 8
40 9 21 15 24 46
9 21 18 45 30 33
h1(x) = 7x + 3 mod 51
h2(x) = 5x + 6 mod 51
hN(x) = 3x + 9 mod 51
…
compute N randompermutations
…
8
9
9
N
MIPs vector:minimaof perm.
8
9
33
24
36
9
8
24
45
24
48
13
MIPs(set1)
MIPs(set2)
estimatedoverlap = 2/6
Peer-to-Peer Information Search - SBBD 2007 Tutorial 4321/04/23
Bloom Filter [Bloom 1979] bit array of size m k hash functions h_i: docId_space {1,..,m} insert n docs by hashing the ids and settings the corresponding
bits document is in the Bloom Filter if the corresponding bits are set probability of false positives (pfp)
tradeoff accuracy vs. efficiency
bits 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
h19
h2 2
h114
h2 6
1 1 11
h115
h2 9
X
h16
h2 2
111
And
rei B
rode
r an
d M
icha
el M
itzen
mac
her:
Net
wor
k A
pplic
atio
ns o
f B
loom
Filt
ers:
A S
urve
y. In
tern
et M
athe
mat
ics
1(4)
. 200
5.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 4421/04/23
Multi-Key Statistics solves interesting problem:
peer with lot of docs on american football and lots of documents about pop music has not a single document about american music
cannot be predicted using per-term statistics
)()()! and ( bqualityaqualitybaquality
)()( adfaquality Obvious: Recall that
cwavgcwdf
dfT
_/*15050 ....
Peer-to-Peer Information Search - SBBD 2007 Tutorial 4521/04/23
Multi-Key Statistics in P2P Motivation:
estimated_quality(a and b) = quality(a) + quality (b) = df_a + df_b != df_(a and b)
Impossible (Infeasible) to consider all term-pairs, triplets, quadruples, .....
Query Driven: Analyze query logs @ directory peers. + Data driven verficication:
P[Anna|Kournikova] = ...... P[Andy|Rodick] = P[Berlin|Marathon] =
No additional messages + shorter lists + highly accurateSebastian Michel, Matthias Bender, Nikos Ntarmos, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices. CIKM 2006: 172-181
addit
ional
statis
tics
ofte
n no
t nee
ded Whole process
can be easilyintegrated intoPeer-level P2P IR
Peer-to-Peer Information Search - SBBD 2007 Tutorial 4621/04/23
Single-term vs. multi-term P2P document Single-term vs. multi-term P2P document indexingindexing Single term indexing
term 1 posting list 1 term 2 posting list 2
term M-1 posting list M-1term M posting list M
... ...
long posting lists
sm
all v
oc.
key 11 posting list 11 key 12 posting list 12
key 1i posting list 1i
... ...
short posting lists
larg
e v o
c .
PEER 1
...
key N1 posting list N1 key N2 posting list N2
key Nj posting list Nj
... ... PEER N
PEER 1
PEER N
...
Multi-term keysMulti term indexing
make use of highly discriminative keys
limit influence of overly long index lists
consider term pairs (triplets ...) for shorter lists
efficient query processing
Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko, Martin Rajman, Karl Aberer: Web text retrieval with a P2P query-driven index. SIGIR 2007: 679-686
Peer-to-Peer Information Search - SBBD 2007 Tutorial 4721/04/23
Literature Overlap Awareness:
Ronak Desai, Qi Yang, Zonghuan Wu, Weiyi Meng, Clement T. Yu: Identifying redundant search engines in a very large scale metasearch engine context. WIDM 2006: 51-58
Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Improving collection selection with overlap awareness in P2P search engines. SIGIR 2005: 67-74
Thomas Hernandez, Subbarao Kambhampati: Improving text collection selection with coverage and overlap statistics. WWW (Special interest tracks and posters) 2005: 1128-1129
Sketches Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael
Mitzenmacher: Min-Wise Independent Permutations. J. Comput. Syst. Sci. 60(3): 630-659 (2000)
Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985)
Andrei Broder and Michael Mitzenmacher: Network Applications of Bloom Filters: A Survey. Internet Mathematics 1(4). 2005.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 4821/04/23
Literature Multi-key statistics:
Ivana Podnar, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys. ICDE 2007: 1096-1105
Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko, Martin Rajman, Karl Aberer: Web text retrieval with a P2P query-driven index. SIGIR 2007: 679-686
Sebastian Michel, Matthias Bender, Nikos Ntarmos, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices. CIKM 2006: 172-181
Peer-to-Peer Information Search - SBBD 2007 Tutorial 4921/04/23
Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result
Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
Peer-to-Peer Information Search - SBBD 2007 Tutorial 5021/04/23
For the IR people .... Why top-k?
Cannot take a look at all matching documents E.g., Google provides millions of documents about Britney Spears
Requires ranking (scoring):
In text retrieval for instance
+ of course pagerank if you wish
Remember Part one: Local Query Execution at each peer (peer-index-model)AND truly distributed top-k processing in the full document-index.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 5121/04/23
For the DB guys ... Table with schema (id, attribute, value)
SELECT id, aggr(value)from tablegroup by idsort by aggr(value) desclimit k
Peer-to-Peer Information Search - SBBD 2007 Tutorial 5221/04/23
For the networking guys ...
IP Bytes in kB
192.168.1.7 31kB
192.168.1.3 23kB
192.168.1.4 12kB
IP Bytes in kB
192.168.1.8 81kB
192.168.1.3 33kB
192.168.1.1 12kB
IP Bytes in kB
192.168.1.4 53kB
192.168.1.3 21kB
192.168.1.1 9kB
IP Bytes in kB
192.168.1.1 29kB
192.168.1.4 28kB
192.168.1.5 12kB
Network MonitoringFind clients that cause high network traffic.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 5321/04/23
Computational Model m lists with (itemId, score)-pairs sorted by score descending. One list per attribute (e.g. term) Aggregation function
aggr() Monotonicity is important
for all items a, b:
whith denoting the score of item x in list i
Goal: return the top-k items w.r.t. their aggregated (overall) scores
)()()()( baggraaggrbscoreascorei ii
)( ixscore
Peer-to-Peer Information Search - SBBD 2007 Tutorial 5421/04/23
How to process this? Most popular: Family of threshold algorithms
Fagin, 1999 Nepal/ Ramakrishna, 1999 Güntzer/Balke/Kießling, 2001
Basic ideas: keep upper and lower score bound for each document
lowerbound (or worstscore) = sum of scores we have seen so far
assuming 0 for unseen dimensions upperbound (or bestscore) = lowerbound + highest possible value for
unseen dimensions know what we‘ve got already; know what do expect stop if no further step can improve the current (i.e. final) ranking
Peer-to-Peer Information Search - SBBD 2007 Tutorial 5521/04/23
Fagin’s NRA NRA(q,L): top-k := ; candidates := ; min-k := 0; scan all lists Li (i = 1..m) in parallel: consider item d at position posi in Li; E(d) := E(d) {i};
highi := si(qi,d); worstscore(d) := aggr{s(q,d)|E(d)};
bestscore(d):= aggr{aggr{s(q,d)|E(d)}, aggr{high|E(d)}}; if worstscore(d) > min-k then remove argmind’{worstscore(d’)|d’top-k} from top-k; add d to top-k min-k := min{worstscore(d’) | d’ top-k}; else if bestscore(d) > min-k then candidates := candidates {d}; threshold := max {bestscore(d’) | d’ candidates}; if threshold min-k then exit;
Peer-to-Peer Information Search - SBBD 2007 Tutorial 5621/04/23
Index listsIndex lists
s(t1,d1) = 0.7…s(tm,d1) = 0.2
s(t1,d1) = 0.7…s(tm,d1) = 0.2
…
Data items: d1, …, dn
Query: q = (t1, t2, t3)
Rank Doc Worst-score
Best-score
1 d78 0.9 2.4
2 d64 0.8 2.4
3 d10 0.7 2.4
Rank Doc Worst-score
Best-score
1 d78 1.4 2.0
2 d23 1.4 1.9
3 d64 0.8 2.1
4 d10 0.7 2.1
Rank Doc Worst-score
Best-score
1 d10 2.1 2.1
2 d78 1.4 2.0
3 d23 1.4 1.8
4 d64 1.2 2.0
…
…
t1d780.9
d10.7
d880.2
d100.2
d780.1
d990.2
d340.1
d230.8
d100.8
d1d1
t2d640.8
d230.6
d100.6
t3d100.7
d780.5
d640.4
STOP!STOP!
Scan depth 1Scan
depth 1Scan
depth 2Scan
depth 2Scan
depth 3Scan
depth 3
k = 1
Top-k Search
Peer-to-Peer Information Search - SBBD 2007 Tutorial 5721/04/23
Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result
Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
Peer-to-Peer Information Search - SBBD 2007 Tutorial 5821/04/23
Observation: pruning often overly conservative (deep scans, high memory for priority queue)
Evolution of a Candidate’s Score
Approximate top-k “What is the probability that d qualifies for the top-k ?”
scan depth
bestscored
worstscored
min-k
score drop d
from the candidate queue
Peer-to-Peer Information Search - SBBD 2007 Tutorial 5921/04/23
Safe Thresholding vs. Probabilistic Guarantees
NRA based on invariant
Relaxed into probabilistic threshold test
Or equivalently, with
)( )( )(
)()()(dEi dEi dEi
ii highdsdsds
bestscored
worstscored
min-k
δ(d)
( ) : min { | ( )}k id s i E d
worstscored bestscored
kdsdsPdpdEi dEi
ii min)()(:)()( )(
)(
)()()(dEi
i ddsPdp
Peer-to-Peer Information Search - SBBD 2007 Tutorial 6021/04/23
Expected Result Quality Missing relevant items
Probability p_miss of missing a true top-k object equals the probability of erroneously dropping a candidate from the queue
For each candidate p_miss ≤ ε P[recall = r/k] = P[precision = r/k] =
E[precision] = E[recall] =
)()1( rkmiss
rmiss pp
r
k
kr
krkrprecisionP..0
)1(/*]/[
Peer-to-Peer Information Search - SBBD 2007 Tutorial 6121/04/23
Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result
Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k
Peer-to-Peer Information Search - SBBD 2007 Tutorial 6221/04/23
Going distributed
Key Observations: Network traffic is crucial Number of round trips is crucial
Straight forward application of TA/NRA? expensive: huge number of rounds trips even with batching: unpredictable performance
Peer-to-Peer Information Search - SBBD 2007 Tutorial 6321/04/23
Where is the data?
Consider network consumption per peer load latency (query response time)
network I/O processing
P0
P1
P2
P3
…t1d780.9
d10.7
d880.2
d230.8
d100.8
…d100.2
d780.1t2
d640.8
d230.6
d100.6
…d990.2
d340.1t3
d100.7
d780.5
d640.4
P4
P5P2
P3P1
P4
P5P2
P3P1
Peer-to-Peer Information Search - SBBD 2007 Tutorial 6421/04/23
Three Phase Uniform Threshold Algorithm[Cao and Wang, PODC 2004]
Exactly 3 phases:
1. fetch k best entries (d, sj) from each of P1 ... Pm and aggregate (j=1..m sj(d)) at query initiator
2. ask each of P1 ... Pm for all entries with sj > min-k / m and aggregate results at query initiator. min-k is score of item currently at rank k.
3. fetch missing scores for all candidates by random lookups at P1 ... Pm
First distributed top-k algorithm with fixed number of phases!
Peer-to-Peer Information Search - SBBD 2007 Tutorial 6521/04/23
...
Index List
CohortPeer Pi
Coordinator Peer P0
current top-k-
candidate set
...
score
Index List
CohortPeer Pj
score
topk
topk
can
did
ates
can
did
ates
min-k / m min-k / m
min-k / m
Ret
riev
e m
issi
ng
sco
res
Retrieve missing scores
Peer-to-Peer Information Search - SBBD 2007 Tutorial 6621/04/23
Analysis of TPUT Theorem: TPUT is an exact algorithm, i.e. identifies the
true top-k items
Proof (sketch): TPUT cannot miss a true top-k item.
Assume it misses one, i.e. item is below mink/m in all lists.
overall score < mink
not a true top-k item!list 1 list 2 list 3
min-k score< min-k
State afterphase 2:
Peer-to-Peer Information Search - SBBD 2007 Tutorial 6721/04/23
if mink / m is small TPUT retrieves a lot of data in Phase 2
high network traffic random accesses
high per-peer load
KLEE [VLDB ‘05] Different philosophy: approximate answers Efficiency:
Reduces (docId, score)-pair transfers no random accesses at each peer
Two pillars: The HistogramBlooms structure The Candidate List Filter structure
Analysis of TPUT
Peer-to-Peer Information Search - SBBD 2007 Tutorial 6821/04/23
Additional Data Structures
Equi-width histogram
+ Bloom filter for each cell+ average score per cell+ upper/lower score
score
#do
cs
01100
00101
01100
00101
00110
00101
01110
00101
01100
00101
10 Usage:During Phase 1:
+ fetch top-k from each list
+ top-c cells
“increase” the min-k / m threshold
Peer-to-Peer Information Search - SBBD 2007 Tutorial 6921/04/23
KLEE..
.
Index List
CohortPeer Pi
Coordinator Peer P0
current top-k-
candidate set
...
score
Index List
CohortPeer Pj
Histogram Histogram
b bits
00
01
01
10
00
01
01
10
01
01
10
10
01
01
10
10
01
00
10
10
01
00
10
10
00
01
00
10
00
01
00
10
01
00
01
11
01
00
01
11c cells
b bits
01
01
01
01
01
01
01
01
00
01
11
01
00
01
11
01
01
00
00
00
01
00
00
00
00
00
00
10
00
00
00
10
01
00
11
10
01
00
11
10
c cells
score
topk
topk
can
did
ates
can
did
ates
min-k / m min-k / m
Peer-to-Peer Information Search - SBBD 2007 Tutorial 7021/04/23
KLEE– Candidate Set Reduction
...
score
010010000100010001
Index List
CohortPeer Pi
topk
Coordinator Peer P0
min-k / m
current top-k
candidate set
0000100000100000001
xx x
can
did
ates
min
-k / m
candidate filter matrix
CohortPeer Pj
100010100000010001
0000100000100000001 0000100000100000001
Peer-to-Peer Information Search - SBBD 2007 Tutorial 7121/04/23
KLEE – Candidate Retrieval
...
score
010010000100010001
Index List
CohortPeer Pi
topk
Coordinator Peer P0
min-k / m
current top-k
candidate set
0000100000100000001
xx x
can
did
ates
early stoppingpoint
candidate filter matrix
CohortPeer Pj
100010100000010001
0000100000100000001 0000100000100000001
Peer-to-Peer Information Search - SBBD 2007 Tutorial 7221/04/23
Literature Ronald Fagin: Combining Fuzzy Information from Multiple Systems. J.
Comput. Syst. Sci. 58(1): 83-99 (1999) Ronald Fagin, Amnon Lotem, Moni Naor: Optimal aggregation algorithms
for middleware. J. Comput. Syst. Sci. 66(4): 614-656 (2003) Surya Nepal, M. V. Ramakrishna: Query Processing Issues in Image
(Multimedia) Databases. ICDE 1999: 22-29 Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling: Towards Efficient Multi-
Feature Queries in Heterogeneous Environments. ITCC 2001: 622-628 Martin Theobald, Gerhard Weikum, Ralf Schenkel: Top-k Query Evaluation
with Probabilistic Guarantees. VLDB 2004: 648-659 Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald,
Gerhard Weikum: IO-Top-k: Index-access Optimized Top-k Query Processing. VLDB 2006: 475-486
Amélie Marian, Nicolas Bruno, Luis Gravano: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2): 319-362 (2004)
Pei Cao, Zhe Wang: Efficient top-K query calculation in distributed networks. PODC 2004: 206-215
Sebastian Michel, Peter Triantafillou, Gerhard Weikum: KLEE: A Framework for Distributed Top-k Query Algorithms. VLDB 2005: 637-648
Peer-to-Peer Information Search - SBBD 2007 Tutorial 7321/04/23
Part II – Social Search
Peer-to-Peer Information Search - SBBD 2007 Tutorial 7421/04/23
Peer-to-Peer Information Search - SBBD 2007 Tutorial 7521/04/23
Motivation People connected through a network
People create links to other people Links can express friendship, recommendations, etc Different graph structures appear
Sharing interests Enables users to find others who share common interests Similar users can provide relevant content
Users and content spread at different sites Distributed nature and continuously increasing size call for peer-
to-peer approaches
Peer-to-Peer Information Search - SBBD 2007 Tutorial 7621/04/23
Outline of the Second Part Link Analysis: The Web as a Graph
PageRank Distributed Approaches
BlockRank Local PageRank + ServerRank Adaptive OPIC JXP
Identifying common interests – Semantic Overlay Networks Crespo and Garcia Molina pSearch p2pDating
Social Networks – A new paradigm What people share Social graphs Links, Tags, users analysis
Peer-to-Peer Information Search - SBBD 2007 Tutorial 7721/04/23
Links are everywhere… …connecting Web pages
www.openp2p.com/...
www.searchtools.com
www.searchengines.com
www.searchengineguide.com
www.searchengineshowdown.com
searchenginewatch.com
Peer-to-Peer Information Search - SBBD 2007 Tutorial 7821/04/23
Links are everywhere… …connecting people
Example of a Flickr’s friends network
Peer-to-Peer Information Search - SBBD 2007 Tutorial 7921/04/23
Links are everywhere… …connecting products
Peer-to-Peer Information Search - SBBD 2007 Tutorial 8021/04/23
Links Analysis The set of nodes/pages (e.g., web pages, people,
products, etc) and the links connecting them define a graph
www.openp2p.com/...
www.searchtools.com
www.searchengines.com
www.searchengineguide.com
www.searchengineshowdown.com
searchenginewatch.com
Peer-to-Peer Information Search - SBBD 2007 Tutorial 8121/04/23
Link Analysis At the end we have something like this…
Lots of useful information can be obtained from the analysis of the such graphs
Peer-to-Peer Information Search - SBBD 2007 Tutorial 8221/04/23
Adjacency Matrix Matrix representation of graphs Given a graph G, its adjacency matrix A is nxn and
aij = 1, it there is a link from node i to node j
aij = 0, otherwise
Peer-to-Peer Information Search - SBBD 2007 Tutorial 8321/04/23
PageRank – Exploring the Wisdom of Crowds
Measures relative importance of pages on the graph Importance of a page depends on the importance of the
pages that point to it Random Surfer Model: once in a page, the surfer
chooses to follow one of the outlinks with prob. α, or to jump to a random page with prob. (1- α)
PR: probability of being at a certain
page, after a enough number of jumps
S. Brin & L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW Conf. 1998.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 8421/04/23
PageRank – Formal Definition:
Np
pPRqPR
qpp
1)1(
)(out
)()(
|
N → Total number of pages;
PR(p) → PageRank of page p;
out(p) → Outdegree of p
ε→ Random jump probability
Can be computed using power iteration method In practice more efficient versions can be used
GGooooggllee is believed to use it on the Web graph, combined with other metrics, to rank their search results
Peer-to-Peer Information Search - SBBD 2007 Tutorial 8521/04/23
A → Matrix containing the transition probabilities
where Pij = 1/out(i), if there is a link from i to j, 0 otherwise; E is the random jumps matrix
Probability distribution vector at time k
is the starting vector PageRank → Stationary distribution of the Markov Chain
described by A, i.e., principal eigenvector or A
PageRank – Matrix Notation
)0()( xAx kk
)(lim kk xPageRank
EPA T )1(
)0(x
Peer-to-Peer Information Search - SBBD 2007 Tutorial 8621/04/23
Going Distributed PageRank in principle needs the whole graph at one
place Shortcomings:
Not Scalable for huge graphs, like the Web Slow update – PageRank in such huge graph can take weeks Not suitable for different network architectures (e.g. P2P)
Distributed approaches, where the graph is partitioned, are clearly needed
Some distributed approaches (more details on the next slides): Local PageRank + ServerRank (Wang et al.) BlockRank (Kamvar et al.) JXP (Parreira et al.)
Peer-to-Peer Information Search - SBBD 2007 Tutorial 8721/04/23
The “Block Structure” Most of links are among web pages inside same host
1 11 1 1 1 1
1 1 11 1 1
1 1 11 1 1
1 1 11 1 1 1
1 1 1 11 1 1 1
1 1 1 1 1 1 11 1 1 1 1 1
1 1 11 1 1 1 1 1 1
1 1 1 1 1 11 1 1 1 1 1 1
1 1 1 11 1 1 1 1 1 1 1
1 1 1 11 1 1 1 1 1
1 1 1 1 11 1 1 1 1 1 1
1 1 1 1 1 11 1 1 1 1 1 1 1 1
1 1 1 1
Pages from Host A
Pages from Host B
Adjacency Matrix
Block structure can be exploited for speeding up and/or distributing the PR computation
Peer-to-Peer Information Search - SBBD 2007 Tutorial 8821/04/23
BlockRank PageRank in three steps:
1. Computes “local PageRanks” of pages for each host, by considering only intra host links
2. Computes the importance of the host, using the local PR values and the inter host links
3. Combines previous values to create the starting vector for the standard PR algorithm
Speeds up computation Step 1 can be parallelized Still needs the whole matrix for step 3
S. Kamvar, T. Haveliwala, C. Manning & G. Golub. Exploiting the block structure of the web for computing pagerank. Technical report, Stanford University, 2003.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 8921/04/23
Going Distributed… Local PR + ServerRank
Similar to BlockRank Local PR : PR computed inside each server using intra server links ServerRank: PR computed on server graph using inter server links
Server graph does not need to be materialized. Computation is done by exchanging messages among servers
Local PR and ServerRank are combined to approximate the true PR of a page
Values can be further refined by using Local PR info on ServerRank computation and vice versa.
Server partition can be a limitation…
Y. Wang & D. J. DeWitt. Computing pagerank in a distributed internet search system. In VLDB, 2004.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 9021/04/23
Partition at “peer level” In P2P networks, server partition is not suitable
Global Graph
Peer B
Peer A
Peer C
Peer-to-Peer Information Search - SBBD 2007 Tutorial 9121/04/23
Partition at “peer level” Every peer crawls Web fragments at its discretion
Peers have only local (incomplete) information Pages might be link to or linked by pages at other peers Overlaps between peers’ graphs may occur Peers a priori unaware of other peers’ contents
Peer B
Peer A
Peer C
Peer-to-Peer Information Search - SBBD 2007 Tutorial 9221/04/23
Adaptive OPIC OPIC: Online Page Importance Computation Computes the importance of a page on-line, with few
resources Algorithm:
Pages initially receive some cash Pages are randomly visited When a page is visited, its cash is distributed between the pages it
points to The page importance for a given page is computed using the
history of cash of that page
Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. In WWW, 2003.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 9321/04/23
Adaptive OPICExample:
Small Web of 3 pages Alice has all the cash to start
(Importance independent of the initial state)
Alice
Bob George
Cash-Game History:Alice received 600 (200+400) 40%Bob received 600 (200+100+300) 40%George received 300 (200+100) 20%
Peer-to-Peer Information Search - SBBD 2007 Tutorial 9421/04/23
Adaptive OPIC No particular graph partition No need to store the link matrix Adapts to the changes on the web graph by considering
only the recent part of the cash history for each page Time window: [now-T, now]
High number of messages exchanged Does not handle case where same page is stored at
more than one place
Peer-to-Peer Information Search - SBBD 2007 Tutorial 9521/04/23
The JXP Algorithm Decentralized algorithm for computing global authority
scores of pages in a P2P Network Runs locally at every peer No coordinator, asynchronous Combines Local PageRank computations + Meetings
between peers JXP scores converge to the true global PageRank
scores
Josiane Xavier Parreira, Carlos Castillo, Debora Donato, Sebastian Michel and Gerhard Weikum: The JXP Method for Robust PageRank Approximation in a Peer-to-Peer Web Search Network. The VLDB Journal, 2007.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 9621/04/23
The JXP Algorithm “World Node”:
Special node attached to the local graph at every peer Compact representation of all other pages in the network “Special features”:
All links from local pages to external pages point to World Node Links from external pages that point to local pages (discovered during meetings) are
represented at the World Node Score and outdegree of these external pages are stored; World Node outgoing links are
weighted to reflect score mass given by original link Self-loop link to represent transitions among external pages
W
Peer-to-Peer Information Search - SBBD 2007 Tutorial 9721/04/23
The JXP Algorithm Initialization step:
Local graph is extended by adding the world node PageRank is computed in the extended graph → JXP Scores
Main algorithm (for every Pi in the network) Select Pj to meet Update world node
Add edges for pages in Pj that point to pages in Pi
If an edge already exists at the world node, the score of the source page is updated by taking the highest of both scores
Compute PageRank → JXP scores
Peer-to-Peer Information Search - SBBD 2007 Tutorial 9821/04/23
The JXP Algorithm
W node:
G → C
J → E
A
B
D
E
WC
A → F
E → G
Peer X
W node:
G → C
J → E
A
B
D
E
WC
A → F
E → G
Peer X
W node:
K → E
L → G
F
G
WE
G → C
F → A
E → B
Peer Y
W node:
K → E
L → G
F
G
WE
G → C
F → A
E → B
Peer Y
W node:
K → E
FF → E
F → G
Subgraph relevant to Peer X
F → A
W node:
K → E
FF → E
F → G
Subgraph relevant to Peer X
F → AW node:
G → C
J → E
F → A
F → E
K → E
A
B
D
E
WC
A → F
E → G
Peer X
W node:
G → C
J → E
F → A
F → E
K → E
A
B
D
E
WC
A → F
E → G
Peer X
Theorem: “In a fair series of JXP meetings, the JXP scores of all nodes converge to the true global PR scores”
Peer-to-Peer Information Search - SBBD 2007 Tutorial 9921/04/23
Locating parts of the Graph “Finding peers that share common interests” Many applications can benefit from it Distributed PR
In principle, peers need to send content only to the peers that contain their successors
Random messages guarantees that those peers will eventually be reached, but part of messages will be “wasted”
Peer-to-Peer Information Search - SBBD 2007 Tutorial 10021/04/23
W node:
P → M
L
N
WM
N → S
M → R
Peer Z
W node:
G → C
J → E
A
B
D
E
WC
A → F
E → G
Peer X
W node:
G → C
J → E
A
B
D
E
WC
A → F
E → G
Peer X
Subgraph relevant to Peer X
W node:
G → C
J → E
A
B
D
E
WC
A → F
E → G
Peer X
W node:
G → C
J → E
A
B
D
E
WC
A → F
E → G
Peer X
WASTED MEETING!!!!
We want to avoid it!!!
Peer-to-Peer Information Search - SBBD 2007 Tutorial 10121/04/23
Locating parts of the Graph Query answering
Ideal: Forward query only to peers that are more likely to provide good answers to it
Query flooding is very expensive Hash-based queries are not suitable for approximate queries
Peer-to-Peer Information Search - SBBD 2007 Tutorial 10221/04/23
Locating parts of the Graph Locating “relevant” peers
Increase performance Reduce traffic load
Idea: Group peers according to the semantic of their content and place them into different overlay networks
Peer-to-Peer Information Search - SBBD 2007 Tutorial 10321/04/23
Outline of the Second Part Link Analysis: The Web as a Graph
PageRank Distributed Approaches
BlockRank Local PageRank + ServerRank Adaptive OPIC JXP
Identifying common interests – Semantic Overlay Networks Crespo and Garcia Molina pSearch p2pDating
Social Networks – A new paradigm What people share Social graphs Links, Tags, users analysis
Peer-to-Peer Information Search - SBBD 2007 Tutorial 10421/04/23
Semantic Overlay Networks Partition the P2P network into several thematic networks Peers with similar or beneficial/complementary content
are “clustered” together Queries for a content will be forwarded only to peers with such
content Flooding in smaller networks with smaller TTL (or more results
with same)
Peer-to-Peer Information Search - SBBD 2007 Tutorial 10521/04/23
Overlay Networks: Random vs. Semantic
RandomRandom Peers connect to a small set of
random peers Queries are flooded through
the network Peers with unrelated content
receive query Low performance: High
number of messages Low recall if only few peers are
contacted
SemanticSemantic Peers connect to peers with
related content → Cluster of peers
Peers identify query’s topic and forward it only the set of peers on that topic
Messages to peers with unrelated content are avoided
Better performance: Smaller number of messages
High recall by asking only few peers
Peer-to-Peer Information Search - SBBD 2007 Tutorial 10621/04/23
When creating SONs… Two main things to consider
Node partitioning Clustering criteria
Node partitioning - When does a peer belong to SON A? When it contains a doc of type A When it contains more than x docs of type A
Less peers per SON → more results sooner Less SONs per peer → less connections
Clustering criteria - Clustering must provide: Load-balance
Each category has similar number of nodes Each node belongs to a small number of categories
Easy and accurate way to classify a document
Peer-to-Peer Information Search - SBBD 2007 Tutorial 10721/04/23
Crespo and Garcia-Molina
Uses a classification hierarchy to form the overlay networks Documents and queries are classified into one or more concepts Queries are forwarded to peers in the super/sub concepts
A. Crespo and H. Garcia-Molina. Semantic Overlay Networks for P2P Systems. Technical report, Stanford University, January 2003.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 10821/04/23
Crespo and Garcia-Molina Reported results show a significant improvement on number of
messages Music file sharing scenario: To get half the documents that match a
query: SONs: 461 msgs Gnutella: 1731 msgs
SON links are “logical”: Two peers that are connected on a SON can actually be many hops away from each other Requirement that hierarchy and classification algorithm are shared among all nodes might be a problem
Peer-to-Peer Information Search - SBBD 2007 Tutorial 10921/04/23
pSearch Semantic Overlay on top of Content Addressable
Networks (CANs) Latent Semantic Indexing (LSI) is used to generate a
semantic vector for each document Semantic vectors are used as keys to store docs indices in the
CAN Indices close in semantics are stored close in the overlay
Two types of operations Publish document indices Process queries
Chunqiang Tang, Zhichen Xu, and Sandhya Dwarkadas. Peer-to-peer Information Retrieval Using Self-Organizing Semantic Overlay Networks. In SIGCOMM, 2003.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 11021/04/23
pSearch Key Idea
doc querysemantic space
Peer-to-Peer Information Search - SBBD 2007 Tutorial 11121/04/23
pSearch Key Idea
doc query
A B C
D E F
G H I
semantic space
Peer-to-Peer Information Search - SBBD 2007 Tutorial 11221/04/23
Background:Content-Addressable Network
A B
C D E
• Partition Cartesian space into zones
• Each zone is assigned to a computer
• Neighboring zones are routing neighbors
• An object key is a point in the space
• Object lookup is done through routing
Peer-to-Peer Information Search - SBBD 2007 Tutorial 11321/04/23
Background: Vector Space Model
Term Vectors represent documents and queries Elements correspond to importance of term in document or
vector Statistical computation of vector elements
Term frequency * inverse document frequency Ranking of retrieved documents
Similarity between document vector and query vector
Peer-to-Peer Information Search - SBBD 2007 Tutorial 11421/04/23
Background: Vector Space Model
A: “books on computer networks”B: “network routing in P2P networks”Q: “P2P network”
computernetworkP2Prouting
vocabulary
0.50.500
Va
00.50.250.25
Vb
00.50.50
Vq
0.25 0.375
Peer-to-Peer Information Search - SBBD 2007 Tutorial 11521/04/23
Background: Latent Semantic Indexing Document vectors dimension has to match the
dimension of the CAN network Latent Semantic Indexing uses Singular Value
Decomposition (SVD) high-dimensional term vector to low-dimensional semantic vector elements correspond to importance of abstract concept in
document/query
Also helps to overcomes synonym problem (e.g., user looks for car and don’t find document about automobile)
Peer-to-Peer Information Search - SBBD 2007 Tutorial 11621/04/23
Background: Latent Semantic Indexing
Va Vb
documents
terms …..
V’a V’b
semantic vectors
SVD …..
SVD: singular value decomposition Reduce dimensionality Suppress noise Discover word semantics
Car <-> Automobile
Peer-to-Peer Information Search - SBBD 2007 Tutorial 11721/04/23
pSearch Basic Algorithm: Steps
1. Receive a new document A: generate a semantic vector Va, store the key in the index
2. Receive a new query Q: generate a semantic vector Vq, route the query in the overlay
3. The query is flooded to nodes within a radius r R determined by similarity threshold or number of wanted documents
4. All receiving nodes do a local search and report references to best matching documents
Peer-to-Peer Information Search - SBBD 2007 Tutorial 11821/04/23
search region for the query
3 33
pSearch Illustration
query doc1
4 42
Peer-to-Peer Information Search - SBBD 2007 Tutorial 11921/04/23
p2pDating
Start with a randomly connected network Peers meet other peers they do not know (“blind dates”)
If a peer “likes” another it will remember it as a “friend”. A remembers B abstract link A → B Directed links preserves peers’ autonomy
SONs dynamically evolve from the meeting process
J. X. Parreira et al. p2pDating: Real Life Inspired Semantic Overlay Networks for Web Search. Information Processing & Management [43], 643-664
Peer-to-Peer Information Search - SBBD 2007 Tutorial 12021/04/23
p2pDating Finding new friends
Random meetings (Blind dates) Meet friends of friends
A B
B’s Friends
If A and B are friends…… it is very likely the B’s friends are friends of A as well.
A
Peer-to-Peer Information Search - SBBD 2007 Tutorial 12121/04/23
Defining Good Friends Criteria for defining a good friend combination of
different measures History: Credits for good behavior in the past
Response time, query result precision, etc…
Collection similarity Collection Overlap
Different ways of estimating the overlap between two collections
Number of links between peers Etc…
Peers might have more than one list of friends E.g., according to different criterias
Peer-to-Peer Information Search - SBBD 2007 Tutorial 12221/04/23
Going Social… Before:
Only few content producers (e.g., companies, universities) Analysis was done using the content itself plus a few implicit
recommendations (links) Very little information about the content consumers (mainly
through query logs)
Nowadays: New technologies to facilitate content sharing Content consumers are now also content producers and
content describers (e.g., explicit recommendations, tags, etc) More and more crowd wisdom that can be harvested
Peer-to-Peer Information Search - SBBD 2007 Tutorial 12321/04/23
Outline of the Second Part Link Analysis: The Web as a Graph
PageRank Distributed Approaches
BlockRank Local PageRank + ServerRank Adaptive OPIC JXP
Identifying common interests – Semantic Overlay Networks Crespo and Garcia Molina pSearch p2pDating
Social Networks – A new paradigm What people share Social graphs Links, Tags, users analysis
Peer-to-Peer Information Search - SBBD 2007 Tutorial 12421/04/23
Peer-to-Peer Information Search - SBBD 2007 Tutorial 12521/04/23
Social Networks A social structure made of nodes (which are generally
individuals or organizations) that are tied by one or more specific types of relations, such as values visions ideas friends conflict web links Etc
Social networks have been studied for over a century
Peer-to-Peer Information Search - SBBD 2007 Tutorial 12621/04/23
Social Network Services Enable the creation of online social networks for communities of
people who share interests and activities, or who are interested in exploring the interests and activities of others
Online communities offer an easy way
for users to publish and share their content.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 12721/04/23
Social Networking Growth Several social networking sites have experienced
dramatic growth during the past year.
Social Networking Site
Total Unique Visitors (Mio.)
Jun-06 Jun-07 % Change
MySpace 66.41 114.15 72
Facebook 14.08 52.17 270
Hi5 18.10 28.17 56
Friendster 14.92 24.68 65
Orkut 13.59 24.12 78
Bebo 6.69 18.20 172
Tagged 1.51 13.17 774
Worldwide Growth of Selected Social Networking Sites. June 2007 vs. June 2006, Users Age 15+, Source: comScore
Peer-to-Peer Information Search - SBBD 2007 Tutorial 12821/04/23
What people share…
Peer-to-Peer Information Search - SBBD 2007 Tutorial 12921/04/23
Social Networks Besides sharing content, a user can…
…describe documents using tags …maintain a list of friends …make comments on other users’ content, exchange opinions,
discover users with similar profile.
In contrast to Web Graph, in Social Graphs users are part of the model
Peer-to-Peer Information Search - SBBD 2007 Tutorial 13021/04/23
Social Content Graph
Sihem Amer-Yahia, Michael Benedikt, Philip Bohannon: Challenges in Searching Online Communities. IEEE Data Eng. Bull. 30(2): 23-31 (2007)
Peer-to-Peer Information Search - SBBD 2007 Tutorial 13121/04/23
Social Graphs Other models also possible
Directed vs. Undirected edges Etc.
users
tags
docs
Standard IR techniques for Web retrieval need to be adapted to work on social networks - Lot of current
research dedicated on this area
Peer-to-Peer Information Search - SBBD 2007 Tutorial 13221/04/23
Social Networks The Wisdom of Crowds: Beyond PR
Spectral analysis of various graphs E.g., SocialPageRank, FolkRank.
Tag semantic analysis Discovering semantic from tags co-occurrence E.g., SocialSimRank
Distributed View Exploiting social relations to enhance search E.g., PeerSpective
Peer-to-Peer Information Search - SBBD 2007 Tutorial 13321/04/23
Link Analysis in Social Networks SocialPageRank:
High quality web pages are usually popularly annotated and popular web pages, up-to-date web users and hot social annotations can be mutual enhanced.
Let MUT, MTD, MDU be the matrices corresponding to relations UsersTags, TagsDocs, DocsUsers
Compute iteratively:
S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu: Optimizing Web Search Using Social Annotation. WWW 2007
DDUU rMr
'
TTDD rMr
'
UUTT rMr
'
a
c
b
Documents
Users
Tags
Peer-to-Peer Information Search - SBBD 2007 Tutorial 13421/04/23
Link Analysis in Social Networks FolkRank
Define graph G as union of graphs UsersTags, TagsDocs, DocsUsers
Assume each user has personal preference vector Compute iteratively:
FolkRank vector of docs is:
Andreas Hotho, Robert Jäschke, Christoph Schmitz, Gerd Stumme: Information Retrieval in Folksonomies: Search and Ranking. ESWC 2006: 411-426
prMrr DGDD
00 DD rr
Peer-to-Peer Information Search - SBBD 2007 Tutorial 13521/04/23
Tag Similarity SocialSimRank
Idea: Similar annotations (tags) are usually assigned to similar web pages by users with common interests.
sim(t1, t2) ~ aggr {sim(d1,d2) | (t1,d1), (t2,d2)Tagging} sim(d1, d2) ~ aggr {sim(t1,t2) | (t1,d1), (t2,d2)Tagging}
S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu: Optimizing Web Search Using Social Annotation. WWW 2007
Peer-to-Peer Information Search - SBBD 2007 Tutorial 13621/04/23
Exploring friendship connections PeerSpective: users can query their friends’ viewed
pages HTTP proxies on users computers index all browsed content When a Google search in performance, query is also send to the
other proxies in parallel
Alan Mislove, Krishna P. Gummadi, and Peter Druschel. Exploiting Social Networks for Internet Search. HotNets, 2006.
Peer-to-Peer Information Search - SBBD 2007 Tutorial 13721/04/23
Social Networks New paradigm of publishing and searching content Rich data
Different link structures Users input for free!!!
Relatively recent topic: Lots of research opportunities Works mentioned are by no means complete, still a lot to do
Since we are talking about Web 2.0…
http://p2pinformationsearch.blogspot.com/