Text-Based Content Search and Retrieval in ad hoc P2P Communities

17
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen http://www.panic-lab.rutgers.edu/

description

Text-Based Content Search and Retrieval in ad hoc P2P Communities. Francisco Matias Cuenca-Acuna Thu D. Nguyen http://www.panic-lab.rutgers.edu/. Motivation. It is hard to find information in current P2P infrastructures They are designed for name-based search They don’t have quality metrics - PowerPoint PPT Presentation

Transcript of Text-Based Content Search and Retrieval in ad hoc P2P Communities

Page 1: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Text-Based Content Search and Retrieval in ad hoc P2P

CommunitiesFrancisco Matias Cuenca-Acuna

Thu D. Nguyenhttp://www.panic-lab.rutgers.edu/

Page 2: Text-Based Content Search and Retrieval in ad hoc P2P Communities

MotivationIt is hard to find information in current P2P infrastructures

They are designed for name-based search

They don’t have quality metrics

They don’t rank results

Most are optimized to find popular content

The current Internet search model has proven to be effective to locate data

Intuitive term-based query model

Quality metric and ranking critical factors in success of Internet search engines

• Help users to quickly pinpoint relevant documents from vast repository

Page 3: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Goals & challengesEmpower P2P communities with search capabilities similar to Internet search engines

No central serversFault tolerance

Cannot employ current model used by Internet search engines

No central management and administrationResources are fragmentedPeers behaviors are uncontrolled

Page 4: Text-Based Content Search and Retrieval in ad hoc P2P Communities

[K1,..,Kn]

Bloom filterGossiping

Local DirectoryNickname Status IP Keys

Alice Online … [K1,..,Kn]

Bob Offline … [K1,..,Kn]

Charles Online … [K1,..,Kn]

LocalFiles XML

Snippets

Local DirectoryNickname Status IP Keys

Alice Online … [K1,..,Kn]

Bob Offline … [K1,..,Kn]

Charles Online … [K1,..,Kn]

[K1,..,Kn]LocalFiles

Bloom filter

XMLSnippet

s

Summary of PlanetPNodes maintain an index of their content

Represented as Bloom filters

Indexes and Directories are replicated everywhereGossiping keeps peers synchronized

Page 5: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Content search in PlanetP

QueryDiane

Local Directory

[K1,..,Kn]Gary

[K1,..,Kn]Fred

[K1,..,Kn]Edward

[K1,..,Kn]Diane

[K1,..,Kn]

[K1,..,Kn]

[K1,..,Kn]

Keys

Charles

Bob

Alice

Nickname

Bob

Fred

Local lookup

Fred

Bob

Diane

Ranknodes

Diane

Contactcandidates

Fred

File3

File1

File2

Rankresults

STOP

Page 6: Text-Based Content Search and Retrieval in ad hoc P2P Communities

The Vector Space modelDocuments and queries are represented as k-dimensional vectors

Word are weighted according to their relevance for the documentDocuments are weighted according to their words

The angle between a query and a document indicates its similarity

DocumentQuery

Page 7: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Weight assignment (TFxIDF)Idea

Use per doc. Term Frequency (TF) to weight words (WD,t)

Use inverse global popularity (IDF) to find good discriminators among the query terms

IntuitionTF indicates how related a document is to a particular conceptInverse Document Frequency (IDF) identify the words that are good discriminators between documents

WD,t=f(Frequency of t in D)

IDFt=f(No. documents/Frequency of t across documents)

Page 8: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Unfortunately IDF is not suited for P2PRequires an appearance count for every word in the community

We introduce the use of the Inverse Peer Frequency

IPFt=f(No. Peers/Peers with documents containing t)

IPF can be computed with local informationIPF is compatible across the community

Node & document ranking in PlanetP

Page 9: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Stopping conditionIntuitive idea: Stop as soon as k documents are retrieved

Not goodA node might have few highly ranked documents and many that have a low rank

We propose an adaptive approach:Contact nodes one by one and keep a list of the top k documents retrievedStop contacting candidates when p nodes in a row fail to contribute to the top k

Page 10: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Evaluation methodWe use five well known document collections

Each collection comes with a set of queries and relevance judgmentsHere we present results for one (AP89)

We measure recall and precision

Trace Queries DocumentsNumber of

wordsCollection size

(MBs)AP89 97 84678 129603 266.0

Page 11: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Evaluation methodWe use a simulator to test our algorithm

Different file distributionsAgainst a central search engine Quantifying the effect not using an adaptive stopping condition

Page 12: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 50 100 150 200 250 300No. documents requested

Pre

cisi

on

IDF

IPF Ad.W

IPF Ad.U

IPF First-k

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 50 100 150 200 250 300No. documents requested

Rec

all

IDF

IPF Ad.W

IPF Ad.U

IPF First-k

Page 13: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Results cont.

0

50

100

150

200

250

300

350

400

450

0 50 100 150 200 250 300No. documents requested

No

. p

eers

co

nta

cted

IDF W

IPF Ad.W

IPF Ad.U

IPF First-k

Page 14: Text-Based Content Search and Retrieval in ad hoc P2P Communities

More resultsAdjusting the stop condition according to the community size and number of results expected

We provide a linear function to determine p

Recall as the community grows to 1000 (scalability)Overlap between PlanetP’s results and the ones obtained by using standard TFxIDF

80% on average

Page 15: Text-Based Content Search and Retrieval in ad hoc P2P Communities

ConclusionsPlanetP matches TFxIDF's performance using the TFxIPF approximation

Give P2P communities search capabilities as powerful as environments with centralized resourcesTFxIPF is applicable beyond PlanetPPlanetP matches TFxIDF’s performance regardless of how documents are distributed throughout the community

Our stopping heuristic limits searches to a small subset of the community yet allow enough peers to be contacted to guarantee good results

Page 16: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Related WorkTapestry, Pastry, Chord and CAN

Implement a distributed hash table for P2P environmentsOriented towards name based searches (for FS)They already store all the information needed to implement TFxIPF

Cori and GlossAddress the problem of indexing and searching distributed collections of documentsThey build a centralized index that has total knowledge of word usage so they don’t contact unnecessary nodes

Page 17: Text-Based Content Search and Retrieval in ad hoc P2P Communities

Questions?