Peer-to-Peer Information Search Sebastian Michel Ecole Polytechnique Fédérale Lausanne Lausanne -...

Peer-to-Peer Information Search

Sebastian Michel

Ecole Polytechnique Fédérale Lausanne

Lausanne - Switzerland

Josiane Xavier Parreira

Max-Planck Institute for Informatics

Saarbrücken - Germany

Peer-to-Peer Information Search - SBBD 2007 Tutorial 221/04/23

Outline of Part 1 Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result

Merging) Overlapping Sources / Multi-key Statistics Top-k Query Processing Probabilistic Pruning Distributed Top-k

P2P Systems

Known from Napster and others Sharing of mostly illegal content (mp3, movies)

P2P= Pirate-to-Pirate ?? New kind of network organization; no client/server anymore Basic Ideas:

Each peer connects to a few other peers All peers together form powerful networks

Potential Benefits: No single point of failure Load is spread across mulitple peers (Resilient to failures and dynamics)

Peer: “one that is of equal standing with another”

(source: Merriam-Webster Online Dictionary )

Napster

Publish fil

e statis

File Download

d• Central server (index)• Client software sends informationabout users‘ contents to server.• User send queries to server• Server responds with IP of users that store matching files. Peer-to-Peer file sharing!

• Developed in 1998.• First P2P file-sharing system

Gnutella Protocol for distributed file sharing Started in 2000 in 2005: 1.81 million computers connected*

Unstructured Network Truly decentralized Uses message flooding during query execution. Later: version with super nodes and query routing

* http://www.slyck.com/news.php?story=814

Gnutella Style

Paris Hilton?

TTL 2TTL 1

TTL 0TTL 1

Gnutella Style Pros:

no complex statistical bookkeeping

Cons: lot of network traffic some peers might not be

reachable (TTL)

Bit Torrent Idea: Load sharing through file splitting A lot of (legal) software distributors offer software through Bit-torrent Download information in small .torrent file One tracker node per file (specified in torrent file)

segment 1

segment 2

segment 3

segment 4

segment 5

tracker node

Client

segment 1

segment 3

segment 5

segment 4

segment 2

request randompeer list

requestsegments

Incentives: „tit-for-tat“Each peer remembers collaborative peers different priorities

Literature Book: Peer-to-Peer: Harnessing the Power of Disruptive

Technologies by Andy Oram. O'Reilly Media, Inc.

Overlay Networks On top of existing networks

Different way to build an overlay network structured unstructured hybrid

Self* Properties (Promises) Self-Organizing:

evolves, grows..... without being guided/managed

Self-Optimizing Self-Configuring Self-Healing:

Self-Restoration Self-Diagnostics

Self-Protecting

Outline Introduction to P2P Systems Distributed Hashtables & Range Queries Peer-to-Peer IR (Query Routing, Result

Distributed Hash Tables

Hash-Table: given a key, return the bucket id. Based on a hash function (like SHA-1)

Now: Distributed. For a given key, return the id of the peer currently responsible for the key.

Challenge: Purely distributed protocols that cope with node failures, departures, arrivals.

No central manager.

p32p38

Chord uses an m-bit identifier space

ordered in a mod-2m circle, the Chord ring;

maps peers and objects to identifiers in the Chord ring, using the hash function SHA-1

uses consistent hashing: an object with identifier id is

placed on the successor peer, succ(id), which is the first node whose identifier is equal to, or follows id on the Chord ring

Key k (e.g., hash(file name))is assigned to the node withkey p (e.g., hash(IP address))such that k p and there isno node p‘ with k p‘ and p‘<p

k30k38

Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, Hari Balakrishnan: Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM 2001: 149-160

Chordpeer n maintains routing

information about peers that lie on the Chord ring at logarithmically increasing distance

Finger tables

Chord Ring

p38 p32p21

p8 + 4

p8 + 8

p8 + 16

p8 + 2

p8 + 32

p8 + 1

fingertablep8

p42 + 4

p42 + 8

p42 + 16

p42 + 2

p42 + 32

p42 + 1

fingertablep42

p51 + 4

p51 + 8

p51 + 16

p51 + 2

p51 + 32

p51 + 1

fingertablep51

Lookup(54)

Node Joins in Chord

p42 lookup(42)

sets succ pointerp42

moving keys

updates succ pointerp38

init_finger_tables()successor=node.find_successor()predecessor=successor.predecessorpredecessor.successor=new

And others ... P-Grid: Karl Aberer: P-Grid: A Self-Organizing Access Structure for

P2P Information Systems. CoopIS 2001: 179-194 CAN: Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M.

Karp, Scott Shenker: A scalable content-addressable network. 161-172

Pastry: Antony I. T. Rowstron, Peter Druschel: Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. Middleware 2001: 329-350

Bamboo: Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz. Handling Churn in a DHT. Proceedings of the USENIX Annual Technical Conference, June 2004.

Range queries Range queries

A range query [v1, v2] searches for those peers which store data whit key value k [v1, v2]

DHTs only support efficiently exact-match queries

The naïve approach to process range queries in DHTs is to:

query each value of a range individually

It is HIGHLY EXPENSIVE!

DHTs and Range Queries

There are two main solutions to cope with load imbalances i.e. to perform load balancing:

transferring load, or replicating data

Order preserving hash function: mkkf 2*

minmax

min:)(

usually leads skeweddistributions

DHT and Range Queries (2) Existing approaches to deal with range queries:

Locality preserving hashing OP-Chord: Triantafillou et al (2003). Skip Graphs: Aspnes et al (2004)

Hashing ranges of values instead of each value individually CAN-based: Andrzejak et al (2002), Sahin et al (2004)

Another problem in that context: access load imbalances

One possible solution: “hot data” transferring to deal with those load imbalances

However, data transfer does not solve access load imbalances in skewed access (query) distributions

HotRod: replicating hot arcs Theoni Pitoura et al. EDBT 2006.

A peer is “hot” (or overloaded) when > _max, where _max is the upper limit of its resource capacityAn arc of peers is “hot” when at least one of its peers is hot

replicate ranges of values

Efficient Load BalancingLorenz Curves for Access Load Distribution

(r = 200, θ = 0.8)

97%; 90%

97%; 40%

97%; 73%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

cumulative percentage of peers

HotRoD OP-Chord Chord Line of uniformity

HotRoD

OP-Chord

Line of uniformity

Building a P2P Search Engine(Peer to Peer Information Retrieval)

“Distributed Google”

P2P approach best suitable large number of peers exploit mostly idle resources intellectual input of user community scalable and self organizing

Information Retrieval Basics

Document Terms

# of terms(term frequency)

Information Retrieval Basics (2)

index lists with(DocId: tf*idf)sorted by Score

B+ tree on terms

Query Execution: Usually using some kind of threshold algorithm*: - sequential scans over the index lists (round-robin) - (random accesses to fetch missing scores) - aggregate scores - stop when the threshold is reached

Top-k Query Processing: find k documents with the highest total score

e.g. Fagin’s algorithmTA or a variant without random accesses

d17: 0.3d44: 0.4

...d52: 0.1

d53: 0.8d55: 0.6 d12: 0.5

d14: 0.4

d28: 0.1

d51: 0.6

d52: 0.3

d28: 0.7

d17: 0.1

d44: 0.2

d11: 0.6

Going distributed: Index Organization

peer index every peer has its own collection (full

documents) distributed index = index of peer

descriptions

document index

d17: 0.3d44: 0.4

d52: 0.1

d53: 0.8d55: 0.6 d12: 0.5

d14: 0.4...

d28: 0.1

d51: 0.6

d52: 0.3d44: 0.2

d28: 0.7

d17: 0.1d11: 0.6

Peer 1 Peer 2 Peer 3

Peer 2

Peer 1

(Full) Document Index Straight forward from centralized document index Each peer is responsible for storing the index list for a

subset of terms.p1

p32p38

p56Query Routing: DHT lookupsQuery Execution: Distributed Top-k [TPUT ’04, KLEE ‘05]

Peer Index Each peer has its own local index

(e.g., created by web crawls) Peers publish compact per-term

descriptions about their index

Query Routing: 1. DHT lookups2. Retrieve Metadata3. Find most promising peers

Query Execution: - Send the complete Query and merge the incoming results

a: P1 P6 P4

b: P5 P3 P1 P6 ...

Distributed Directory

Term List of Peers

P2P Search with Minerva

book-marksB0

term g: 13, 11, 45, ...term a: 17, 11, 92, ...term f: 43, 65, 92, ...

peer lists (directory)

term g: 13, 11, 45, ...

term c: 13, 92, 45, ...url x: 37, 44, 12, ...

url y: 75, 43, 12, ...

url z: 54, 128, 7, ...

query peer P0

Query routing aims to optimize benefit/cost driven by distributed statistics on peers‘ content quality, content overlap, freshness, authority, trust, etc.

Maintain semantic/social/statistical overlay network (SON)

local index X0

based onscalable,churn-resilientDHT withO(log n) key lookup

peer ranking& statistics

Exploit community behavior (bookmarks, links, tags, clicks, etc.)

Two major Problems Task of merging the obtained results into final ranking:

Result Merging

Task of finding “high quality“ peers: Query Routing aka database/collection/peer selection

Overview articles: J. Callan. (2000). "Distributed information retrieval." In W. B.

Croft, editor, Advances in Information Retrieval. Kluwer Academic Publishers. (pp. 127-150).

Weiyi Meng, Clement T. Yu, King-Lup Liu: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1): 48-89 (2002)

Query Routing Given a Query Q={term1, term2, ...., termN): select the

most promising peers Based on:

per-term per-peer statistics document frequency vocabulary size

+ normalization issues like collection frequency avg vocabulary size

Most popular: CORI, GlOSS, Decision Theoretic Framework (DTF)

cwavgcwdf

_/*15050),(

)(*),(*)1(),( jjiji tItpTbbtps

)0.1log(

p1 p2 pj-1 pj

t1 t2 t3 tk

Apply document ranking to resource ranking

Resources

QpS ),(1

C = #peersdf = document frequency

cf = collection frequencycw = # distinct words per peer

Literature J. Callan. (2000). "Distributed information retrieval." In W.

B. Croft, editor, Advances in Information Retrieval. Kluwer Academic Publishers. (pp. 127-150).

Weiyi Meng, Clement T. Yu, King-Lup Liu: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1): 48-89 (2002)

CORI: James P. Callan, Zhihong Lu, W. Bruce Croft: Searching Distributed Collections with Inference Networks. SIGIR 1995: 21-28

GlOSS: Luis Gravano, Hector Garcia-Molina, Anthony Tomasic: GlOSS: Text-Source Discovery over the Internet. ACM Trans. Database Syst. 24(2): 229-264 (1999)

Decision Theoretic Framework: Norbert Fuhr: A Decision-Theoretic Approach to Database Selection in Networked IR. ACM Trans. Inf. Syst. 17(3): 229-249 (1999)

Problem: incomparable scores Different corpus statistics

df component used in tf*ids scoring functions is not globally known

user with lot of high quality documents for term a high df non expert user with some bad documents for term a low df

Result Merging

Different scoring functions completely different functions different parameters in the same function

Result Merging Approaches Score Normalization by

using global statistics computation of global statistics difficult (not obvious) solution using gossip

score re-computation with query initiator‘s local statistics

required re-ranking and knowledge about document contents

score re-computation using query routing scores routing score available anyway

''**4.0'''

)/()('

minmaxmin

Global DF Estimation

gdf (global doc. freq.) of a term is interesting key measure,but overlap among peers makes simple distr. counting infeasible

hash sketches [Flajolet/Martin 1985]: duplicate-sensitive cardinality estimator for multisets

hash each multiset element x onto m-bit bitvector and remember least significant 1 bit rough intuition: least-significant bit set by half of the documents,

second bit by ¼ of the documents...... Theory says: most significant bit estimator of log (n); n=#documents Higher accuracy: average multiple iid sketches

Global DF EstimationHash sketches of different peers collected at directory peerdistributivity is free!! i {(h(x)) | x Si} = {(h(x)) | x i Si}

gdf estimation algorithm: each peer p posts hash sketch for each (discriminative) term t to

directory directory peer for term t forms union of incoming hash sketches when a peer needs to know gdf(t), simply ask directory peer for t sliding-window techniques for dynamic adjustment

Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum: Global Document Frequency Estimation in Peer-to-Peer Web Search. WebDB 2006

Autonomous Peers Overlapping Sources

1 2 3 4

#peers

{A} {A,B} {A,B,C} {A,..,D}

overlap aware routing strategy:

#peers

{A} {A,E}

How? Enrich published statistics with overlap

estimators. Interested in NOVELTY and QUALITY Iterative greedy selection process

select first peer based on quality select next peer by quality*novelty

Suitable synopses for overlap estimation: Bloom filter [Bloom 1979] hash sketches [Flajolet&Martin 1985] min wise independent permutations [Broder

Min-Wise Independent Permutations [Broder 97]

MIPs are unbiased estimator of overlap: P [min {h(x) | xA} = min {h(y) | yB}] = |AB| / |AB|

set of ids 17 21 3 12 24 8

20 48 24 36 18 8

40 9 21 15 24 46

9 21 18 45 30 33

h1(x) = 7x + 3 mod 51

h2(x) = 5x + 6 mod 51

hN(x) = 3x + 9 mod 51

compute N randompermutations

MIPs vector:minimaof perm.

MIPs(set1)

MIPs(set2)

estimatedoverlap = 2/6

Bloom Filter [Bloom 1979] bit array of size m k hash functions h_i: docId_space {1,..,m} insert n docs by hashing the ids and settings the corresponding

bits document is in the Bloom Filter if the corresponding bits are set probability of false positives (pfp)

tradeoff accuracy vs. efficiency

bits 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 1 11

Multi-Key Statistics solves interesting problem:

peer with lot of docs on american football and lots of documents about pop music has not a single document about american music

cannot be predicted using per-term statistics

)()()! and ( bqualityaqualitybaquality

)()( adfaquality Obvious: Recall that

cwavgcwdf

_/*15050 ....

Multi-Key Statistics in P2P Motivation:

estimated_quality(a and b) = quality(a) + quality (b) = df_a + df_b != df_(a and b)

Impossible (Infeasible) to consider all term-pairs, triplets, quadruples, .....

Query Driven: Analyze query logs @ directory peers. + Data driven verficication:

P[Anna|Kournikova] = ...... P[Andy|Rodick] = P[Berlin|Marathon] =

No additional messages + shorter lists + highly accurateSebastian Michel, Matthias Bender, Nikos Ntarmos, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices. CIKM 2006: 172-181

statis

ded Whole process

can be easilyintegrated intoPeer-level P2P IR

Single-term vs. multi-term P2P document Single-term vs. multi-term P2P document indexingindexing Single term indexing

term 1 posting list 1 term 2 posting list 2

term M-1 posting list M-1term M posting list M

... ...

long posting lists

key 11 posting list 11 key 12 posting list 12

key 1i posting list 1i

... ...

short posting lists

PEER 1

key N1 posting list N1 key N2 posting list N2

key Nj posting list Nj

... ... PEER N

PEER 1

PEER N

Multi-term keysMulti term indexing

make use of highly discriminative keys

limit influence of overly long index lists

consider term pairs (triplets ...) for shorter lists

efficient query processing

Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko, Martin Rajman, Karl Aberer: Web text retrieval with a P2P query-driven index. SIGIR 2007: 679-686

Literature Overlap Awareness:

Ronak Desai, Qi Yang, Zonghuan Wu, Weiyi Meng, Clement T. Yu: Identifying redundant search engines in a very large scale metasearch engine context. WIDM 2006: 51-58

Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Improving collection selection with overlap awareness in P2P search engines. SIGIR 2005: 67-74

Thomas Hernandez, Subbarao Kambhampati: Improving text collection selection with coverage and overlap statistics. WWW (Special interest tracks and posters) 2005: 1128-1129

Sketches Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael

Mitzenmacher: Min-Wise Independent Permutations. J. Comput. Syst. Sci. 60(3): 630-659 (2000)

Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985)

Andrei Broder and Michael Mitzenmacher: Network Applications of Bloom Filters: A Survey. Internet Mathematics 1(4). 2005.

Literature Multi-key statistics:

Ivana Podnar, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys. ICDE 2007: 1096-1105

Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko, Martin Rajman, Karl Aberer: Web text retrieval with a P2P query-driven index. SIGIR 2007: 679-686

Sebastian Michel, Matthias Bender, Nikos Ntarmos, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices. CIKM 2006: 172-181

For the IR people .... Why top-k?

Cannot take a look at all matching documents E.g., Google provides millions of documents about Britney Spears

Requires ranking (scoring):

In text retrieval for instance

+ of course pagerank if you wish

Remember Part one: Local Query Execution at each peer (peer-index-model)AND truly distributed top-k processing in the full document-index.

For the DB guys ... Table with schema (id, attribute, value)

SELECT id, aggr(value)from tablegroup by idsort by aggr(value) desclimit k

For the networking guys ...

IP Bytes in kB

192.168.1.7 31kB

192.168.1.3 23kB

192.168.1.4 12kB

IP Bytes in kB

192.168.1.8 81kB

192.168.1.3 33kB

192.168.1.1 12kB

IP Bytes in kB

192.168.1.4 53kB

192.168.1.3 21kB

192.168.1.1 9kB

IP Bytes in kB

192.168.1.1 29kB

192.168.1.4 28kB

192.168.1.5 12kB

Network MonitoringFind clients that cause high network traffic.

Computational Model m lists with (itemId, score)-pairs sorted by score descending. One list per attribute (e.g. term) Aggregation function

aggr() Monotonicity is important

for all items a, b:

whith denoting the score of item x in list i

Goal: return the top-k items w.r.t. their aggregated (overall) scores

)()()()( baggraaggrbscoreascorei ii

)( ixscore

How to process this? Most popular: Family of threshold algorithms

Fagin, 1999 Nepal/ Ramakrishna, 1999 Güntzer/Balke/Kießling, 2001

Basic ideas: keep upper and lower score bound for each document

lowerbound (or worstscore) = sum of scores we have seen so far

assuming 0 for unseen dimensions upperbound (or bestscore) = lowerbound + highest possible value for

unseen dimensions know what we‘ve got already; know what do expect stop if no further step can improve the current (i.e. final) ranking

Fagin’s NRA NRA(q,L): top-k := ; candidates := ; min-k := 0; scan all lists Li (i = 1..m) in parallel: consider item d at position posi in Li; E(d) := E(d) {i};

highi := si(qi,d); worstscore(d) := aggr{s(q,d)|E(d)};

bestscore(d):= aggr{aggr{s(q,d)|E(d)}, aggr{high|E(d)}}; if worstscore(d) > min-k then remove argmind’{worstscore(d’)|d’top-k} from top-k; add d to top-k min-k := min{worstscore(d’) | d’ top-k}; else if bestscore(d) > min-k then candidates := candidates {d}; threshold := max {bestscore(d’) | d’ candidates}; if threshold min-k then exit;

Index listsIndex lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

Data items: d1, …, dn

Query: q = (t1, t2, t3)

Rank Doc Worst-score

Best-score

1 d78 0.9 2.4

2 d64 0.8 2.4

3 d10 0.7 2.4

Best-score

1 d78 1.4 2.0

2 d23 1.4 1.9

3 d64 0.8 2.1

4 d10 0.7 2.1

Best-score

1 d10 2.1 2.1

2 d78 1.4 2.0

3 d23 1.4 1.8

4 d64 1.2 2.0

t1d780.9

d880.2

d100.2

d780.1

d990.2

d340.1

d230.8

d100.8

t2d640.8

d230.6

d100.6

t3d100.7

d780.5

d640.4

STOP!STOP!

Scan depth 1Scan

depth 1Scan

depth 2Scan

depth 3Scan

depth 3

Top-k Search

Observation: pruning often overly conservative (deep scans, high memory for priority queue)

Evolution of a Candidate’s Score

Approximate top-k “What is the probability that d qualifies for the top-k ?”

scan depth

bestscored

worstscored

score drop d

from the candidate queue

Safe Thresholding vs. Probabilistic Guarantees

NRA based on invariant

Relaxed into probabilistic threshold test

Or equivalently, with

)( )( )(

)()()(dEi dEi dEi

ii highdsdsds

bestscored

worstscored

( ) : min { | ( )}k id s i E d

worstscored bestscored

kdsdsPdpdEi dEi

ii min)()(:)()( )(

)()()(dEi

i ddsPdp

Expected Result Quality Missing relevant items

Probability p_miss of missing a true top-k object equals the probability of erroneously dropping a candidate from the queue

For each candidate p_miss ≤ ε P[recall = r/k] = P[precision = r/k] =

E[precision] = E[recall] =

)()1( rkmiss

rmiss pp

krkrprecisionP..0

)1(/*]/[

Going distributed

Key Observations: Network traffic is crucial Number of round trips is crucial

Straight forward application of TA/NRA? expensive: huge number of rounds trips even with batching: unpredictable performance

Where is the data?

Consider network consumption per peer load latency (query response time)

network I/O processing

…t1d780.9

d880.2

d230.8

d100.8

…d100.2

d780.1t2

d640.8

d230.6

d100.6

…d990.2

d340.1t3

d100.7

d780.5

d640.4

Three Phase Uniform Threshold Algorithm[Cao and Wang, PODC 2004]

Exactly 3 phases:

1. fetch k best entries (d, sj) from each of P1 ... Pm and aggregate (j=1..m sj(d)) at query initiator

2. ask each of P1 ... Pm for all entries with sj > min-k / m and aggregate results at query initiator. min-k is score of item currently at rank k.

3. fetch missing scores for all candidates by random lookups at P1 ... Pm

First distributed top-k algorithm with fixed number of phases!

Index List

CohortPeer Pi

Coordinator Peer P0

current top-k-

candidate set

Index List

CohortPeer Pj

min-k / m min-k / m

min-k / m

Retrieve missing scores

Analysis of TPUT Theorem: TPUT is an exact algorithm, i.e. identifies the

true top-k items

Proof (sketch): TPUT cannot miss a true top-k item.

Assume it misses one, i.e. item is below mink/m in all lists.

overall score < mink

not a true top-k item!list 1 list 2 list 3

min-k score< min-k

State afterphase 2:

if mink / m is small TPUT retrieves a lot of data in Phase 2

high network traffic random accesses

high per-peer load

KLEE [VLDB ‘05] Different philosophy: approximate answers Efficiency:

Reduces (docId, score)-pair transfers no random accesses at each peer

Two pillars: The HistogramBlooms structure The Candidate List Filter structure

Analysis of TPUT

Additional Data Structures

Equi-width histogram

+ Bloom filter for each cell+ average score per cell+ upper/lower score

10 Usage:During Phase 1:

+ fetch top-k from each list

+ top-c cells

“increase” the min-k / m threshold

KLEE..

Index List

CohortPeer Pi

Coordinator Peer P0

current top-k-

candidate set

Index List

CohortPeer Pj

Histogram Histogram

b bits

11c cells

b bits

c cells

min-k / m min-k / m

KLEE– Candidate Set Reduction

010010000100010001

Index List

CohortPeer Pi

Coordinator Peer P0

min-k / m

current top-k

candidate set

0000100000100000001

-k / m

candidate filter matrix

CohortPeer Pj

100010100000010001

0000100000100000001 0000100000100000001

KLEE – Candidate Retrieval

010010000100010001

Index List

CohortPeer Pi

Coordinator Peer P0

min-k / m

current top-k

candidate set

0000100000100000001

early stoppingpoint

candidate filter matrix

CohortPeer Pj

100010100000010001

0000100000100000001 0000100000100000001

Literature Ronald Fagin: Combining Fuzzy Information from Multiple Systems. J.

Comput. Syst. Sci. 58(1): 83-99 (1999) Ronald Fagin, Amnon Lotem, Moni Naor: Optimal aggregation algorithms

for middleware. J. Comput. Syst. Sci. 66(4): 614-656 (2003) Surya Nepal, M. V. Ramakrishna: Query Processing Issues in Image

(Multimedia) Databases. ICDE 1999: 22-29 Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling: Towards Efficient Multi-

Feature Queries in Heterogeneous Environments. ITCC 2001: 622-628 Martin Theobald, Gerhard Weikum, Ralf Schenkel: Top-k Query Evaluation

with Probabilistic Guarantees. VLDB 2004: 648-659 Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald,

Gerhard Weikum: IO-Top-k: Index-access Optimized Top-k Query Processing. VLDB 2006: 475-486

Amélie Marian, Nicolas Bruno, Luis Gravano: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2): 319-362 (2004)

Pei Cao, Zhe Wang: Efficient top-K query calculation in distributed networks. PODC 2004: 206-215

Sebastian Michel, Peter Triantafillou, Gerhard Weikum: KLEE: A Framework for Distributed Top-k Query Algorithms. VLDB 2005: 637-648

Part II – Social Search

Motivation People connected through a network

People create links to other people Links can express friendship, recommendations, etc Different graph structures appear

Sharing interests Enables users to find others who share common interests Similar users can provide relevant content

Users and content spread at different sites Distributed nature and continuously increasing size call for peer-

to-peer approaches

Outline of the Second Part Link Analysis: The Web as a Graph

PageRank Distributed Approaches

BlockRank Local PageRank + ServerRank Adaptive OPIC JXP

Identifying common interests – Semantic Overlay Networks Crespo and Garcia Molina pSearch p2pDating

Social Networks – A new paradigm What people share Social graphs Links, Tags, users analysis

Links are everywhere… …connecting Web pages

www.openp2p.com/...

www.searchtools.com

www.searchengines.com

www.searchengineguide.com

www.searchengineshowdown.com

searchenginewatch.com

Links are everywhere… …connecting people

Example of a Flickr’s friends network

Links are everywhere… …connecting products

Links Analysis The set of nodes/pages (e.g., web pages, people,

products, etc) and the links connecting them define a graph

www.openp2p.com/...

www.searchtools.com

www.searchengines.com

www.searchengineguide.com

www.searchengineshowdown.com

searchenginewatch.com

Link Analysis At the end we have something like this…

Lots of useful information can be obtained from the analysis of the such graphs

Adjacency Matrix Matrix representation of graphs Given a graph G, its adjacency matrix A is nxn and

aij = 1, it there is a link from node i to node j

aij = 0, otherwise

PageRank – Exploring the Wisdom of Crowds

Measures relative importance of pages on the graph Importance of a page depends on the importance of the

pages that point to it Random Surfer Model: once in a page, the surfer

chooses to follow one of the outlinks with prob. α, or to jump to a random page with prob. (1- α)

PR: probability of being at a certain

page, after a enough number of jumps

S. Brin & L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW Conf. 1998.

PageRank – Formal Definition:

pPRqPR

N → Total number of pages;

PR(p) → PageRank of page p;

out(p) → Outdegree of p

ε→ Random jump probability

Can be computed using power iteration method In practice more efficient versions can be used

GGooooggllee is believed to use it on the Web graph, combined with other metrics, to rank their search results

A → Matrix containing the transition probabilities

where Pij = 1/out(i), if there is a link from i to j, 0 otherwise; E is the random jumps matrix

Probability distribution vector at time k

is the starting vector PageRank → Stationary distribution of the Markov Chain

described by A, i.e., principal eigenvector or A

PageRank – Matrix Notation

)0()( xAx kk

)(lim kk xPageRank

EPA T )1(

Going Distributed PageRank in principle needs the whole graph at one

place Shortcomings:

Not Scalable for huge graphs, like the Web Slow update – PageRank in such huge graph can take weeks Not suitable for different network architectures (e.g. P2P)

Distributed approaches, where the graph is partitioned, are clearly needed

Some distributed approaches (more details on the next slides): Local PageRank + ServerRank (Wang et al.) BlockRank (Kamvar et al.) JXP (Parreira et al.)

The “Block Structure” Most of links are among web pages inside same host

1 11 1 1 1 1

1 1 11 1 1

1 1 11 1 1 1

1 1 1 11 1 1 1

1 1 1 1 1 1 11 1 1 1 1 1

1 1 11 1 1 1 1 1 1

1 1 1 1 1 11 1 1 1 1 1 1

1 1 1 11 1 1 1 1 1 1 1

1 1 1 11 1 1 1 1 1

1 1 1 1 11 1 1 1 1 1 1

1 1 1 1 1 11 1 1 1 1 1 1 1 1

1 1 1 1

Pages from Host A

Pages from Host B

Adjacency Matrix

Block structure can be exploited for speeding up and/or distributing the PR computation

BlockRank PageRank in three steps:

1. Computes “local PageRanks” of pages for each host, by considering only intra host links

2. Computes the importance of the host, using the local PR values and the inter host links

3. Combines previous values to create the starting vector for the standard PR algorithm

Speeds up computation Step 1 can be parallelized Still needs the whole matrix for step 3

S. Kamvar, T. Haveliwala, C. Manning & G. Golub. Exploiting the block structure of the web for computing pagerank. Technical report, Stanford University, 2003.

Going Distributed… Local PR + ServerRank

Similar to BlockRank Local PR : PR computed inside each server using intra server links ServerRank: PR computed on server graph using inter server links

Server graph does not need to be materialized. Computation is done by exchanging messages among servers

Local PR and ServerRank are combined to approximate the true PR of a page

Values can be further refined by using Local PR info on ServerRank computation and vice versa.

Server partition can be a limitation…

Y. Wang & D. J. DeWitt. Computing pagerank in a distributed internet search system. In VLDB, 2004.

Partition at “peer level” In P2P networks, server partition is not suitable

Global Graph

Peer B

Peer A

Peer C

Partition at “peer level” Every peer crawls Web fragments at its discretion

Peers have only local (incomplete) information Pages might be link to or linked by pages at other peers Overlaps between peers’ graphs may occur Peers a priori unaware of other peers’ contents

Peer B

Peer A

Peer C

Adaptive OPIC OPIC: Online Page Importance Computation Computes the importance of a page on-line, with few

resources Algorithm:

Pages initially receive some cash Pages are randomly visited When a page is visited, its cash is distributed between the pages it

points to The page importance for a given page is computed using the

history of cash of that page

Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. In WWW, 2003.

Adaptive OPICExample:

Small Web of 3 pages Alice has all the cash to start

(Importance independent of the initial state)

Bob George

Cash-Game History:Alice received 600 (200+400) 40%Bob received 600 (200+100+300) 40%George received 300 (200+100) 20%

Adaptive OPIC No particular graph partition No need to store the link matrix Adapts to the changes on the web graph by considering

only the recent part of the cash history for each page Time window: [now-T, now]

High number of messages exchanged Does not handle case where same page is stored at

more than one place

The JXP Algorithm Decentralized algorithm for computing global authority

scores of pages in a P2P Network Runs locally at every peer No coordinator, asynchronous Combines Local PageRank computations + Meetings

between peers JXP scores converge to the true global PageRank

scores

Josiane Xavier Parreira, Carlos Castillo, Debora Donato, Sebastian Michel and Gerhard Weikum: The JXP Method for Robust PageRank Approximation in a Peer-to-Peer Web Search Network. The VLDB Journal, 2007.

The JXP Algorithm “World Node”:

Special node attached to the local graph at every peer Compact representation of all other pages in the network “Special features”:

All links from local pages to external pages point to World Node Links from external pages that point to local pages (discovered during meetings) are

represented at the World Node Score and outdegree of these external pages are stored; World Node outgoing links are

weighted to reflect score mass given by original link Self-loop link to represent transitions among external pages

The JXP Algorithm Initialization step:

Local graph is extended by adding the world node PageRank is computed in the extended graph → JXP Scores

Main algorithm (for every Pi in the network) Select Pj to meet Update world node

Add edges for pages in Pj that point to pages in Pi

If an edge already exists at the world node, the score of the source page is updated by taking the highest of both scores

Compute PageRank → JXP scores

The JXP Algorithm

W node:

G → C

J → E

A → F

E → G

Peer X

W node:

G → C

J → E

A → F

E → G

Peer X

W node:

K → E

L → G

G → C

F → A

E → B

Peer Y

W node:

K → E

L → G

G → C

F → A

E → B

Peer Y

W node:

K → E

FF → E

F → G

Subgraph relevant to Peer X

F → A

W node:

K → E

FF → E

F → G

F → AW node:

G → C

J → E

F → A

F → E

K → E

A → F

E → G

Peer X

W node:

G → C

J → E

F → A

F → E

K → E

A → F

E → G

Peer X

Theorem: “In a fair series of JXP meetings, the JXP scores of all nodes converge to the true global PR scores”

Locating parts of the Graph “Finding peers that share common interests” Many applications can benefit from it Distributed PR

In principle, peers need to send content only to the peers that contain their successors

Random messages guarantees that those peers will eventually be reached, but part of messages will be “wasted”

W node:

P → M

N → S

M → R

Peer Z

W node:

G → C

J → E

A → F

E → G

Peer X

W node:

G → C

J → E

A → F

E → G

Peer X

W node:

G → C

J → E

A → F

E → G

Peer X

W node:

G → C

J → E

A → F

E → G

Peer X

WASTED MEETING!!!!

We want to avoid it!!!

Locating parts of the Graph Query answering

Ideal: Forward query only to peers that are more likely to provide good answers to it

Query flooding is very expensive Hash-based queries are not suitable for approximate queries

Locating parts of the Graph Locating “relevant” peers

Increase performance Reduce traffic load

Idea: Group peers according to the semantic of their content and place them into different overlay networks

Semantic Overlay Networks Partition the P2P network into several thematic networks Peers with similar or beneficial/complementary content

are “clustered” together Queries for a content will be forwarded only to peers with such

content Flooding in smaller networks with smaller TTL (or more results

with same)

Overlay Networks: Random vs. Semantic

RandomRandom Peers connect to a small set of

random peers Queries are flooded through

the network Peers with unrelated content

receive query Low performance: High

number of messages Low recall if only few peers are

contacted

SemanticSemantic Peers connect to peers with

Peer-to-Peer Information Search Sebastian Michel Ecole Polytechnique Fédérale Lausanne Lausanne -...

Documents

Transcript of Peer-to-Peer Information Search Sebastian Michel Ecole Polytechnique Fédérale Lausanne Lausanne -...

Ceramicas Odontologicas - Giovani Gambogi Parreira 1 Ed.

Maira Nani França Angélica Aparecida Parreira Lemos Ruiz ...

Points forts 2013 - Lausanne Région · Rhodanie 2 – 1000 Lausanne 6 +41 21 613 73 38 – mail@lausanneregion.ch - L’association Lausanne Région Lausanne Région travaille par

Portfolio_Daniela Parreira

Rodrigo Parreira - Ciclo Tóquio 2020

Prof. Walteno Martins Parreira Júnior · 2 - LISTAS DUPLAMENTE ENCADEADAS.....18. Algoritmos e Estrutura de Dados – Listas Prof. Walteno Martins Parreira Jr Página 2 1 - LISTAS

Roberta Barros da Costa Parreira - bvssp.icict.fiocruz.br

KARINA ERÁCLEA LARA FERREIRA PARREIRA€¦ · KARINA ERÁCLEA LARA FERREIRA PARREIRA OBESIDADE, UM ESTUDO DOS MECANISMOS HORMONAIS, COMPORTAMENTO ALIMENTAR E IMPACTO PSÍQUICO E

Ligne 21 - Site officiel du Canton de Vaud - VD.CH Lausanne, gare Lausanne, Cécil Lausanne, Boisy Lausanne, Bossons Lausanne, St-Roch Lausanne, Villard ...

PARREIRA ALMOND PROCESSING CO. (RPAC) EXPANSION …

Cardapio Metro Pizzaria Parreira 2015

Adrielle Bicalho Pereira - kennedy.br · Anderson Parreira da Silva . Cleiton Parreira da Silva . Renata G. A. Ricardo . Taís Caroline . Deni Yuri Pereira . Camila ... Marcos A.

Antonio parreira conferencia_aljustrel_2014_2

Entrevista do Prof. Dr. António Parreira

APARENTEMENTE SADIOS DA REGIÃO METROPOLITANA DE … · Ivonete Maria Parreira Orientadora: Profª Drª Valéria de Sá Jayme GOIANIA 2009 . ii. iii IVONETE MARIA PARREIRA ASPECTOS

Parreira, Priscilla Maria Santana; Parreira, Geralda ...

B-030 Rossiane Rosa Paes Parreira

Flavia Mourao Parreira Do Amaral

Evoluotticaeestrategiasdejogo parreira-120130195504-phpapp01

MARIA ELISA PARREIRA ALVARENGA A PSICOSE NA CENA ...