Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need
Text-Based Content Search and Retrieval in ad hoc P2P Communities
-
Upload
amos-bishop -
Category
Documents
-
view
22 -
download
0
description
Transcript of Text-Based Content Search and Retrieval in ad hoc P2P Communities
![Page 1: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/1.jpg)
Text-Based Content Search and Retrieval in ad hoc P2P
CommunitiesFrancisco Matias Cuenca-Acuna
Thu D. Nguyenhttp://www.panic-lab.rutgers.edu/
![Page 2: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/2.jpg)
MotivationIt is hard to find information in current P2P infrastructures
They are designed for name-based search
They don’t have quality metrics
They don’t rank results
Most are optimized to find popular content
The current Internet search model has proven to be effective to locate data
Intuitive term-based query model
Quality metric and ranking critical factors in success of Internet search engines
• Help users to quickly pinpoint relevant documents from vast repository
![Page 3: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/3.jpg)
Goals & challengesEmpower P2P communities with search capabilities similar to Internet search engines
No central serversFault tolerance
Cannot employ current model used by Internet search engines
No central management and administrationResources are fragmentedPeers behaviors are uncontrolled
![Page 4: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/4.jpg)
[K1,..,Kn]
Bloom filterGossiping
Local DirectoryNickname Status IP Keys
Alice Online … [K1,..,Kn]
Bob Offline … [K1,..,Kn]
Charles Online … [K1,..,Kn]
LocalFiles XML
Snippets
Local DirectoryNickname Status IP Keys
Alice Online … [K1,..,Kn]
Bob Offline … [K1,..,Kn]
Charles Online … [K1,..,Kn]
[K1,..,Kn]LocalFiles
Bloom filter
XMLSnippet
s
Summary of PlanetPNodes maintain an index of their content
Represented as Bloom filters
Indexes and Directories are replicated everywhereGossiping keeps peers synchronized
![Page 5: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/5.jpg)
Content search in PlanetP
QueryDiane
Local Directory
[K1,..,Kn]Gary
[K1,..,Kn]Fred
[K1,..,Kn]Edward
[K1,..,Kn]Diane
[K1,..,Kn]
[K1,..,Kn]
[K1,..,Kn]
Keys
Charles
Bob
Alice
Nickname
Bob
Fred
Local lookup
Fred
Bob
Diane
Ranknodes
Diane
Contactcandidates
Fred
File3
File1
File2
Rankresults
STOP
![Page 6: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/6.jpg)
The Vector Space modelDocuments and queries are represented as k-dimensional vectors
Word are weighted according to their relevance for the documentDocuments are weighted according to their words
The angle between a query and a document indicates its similarity
DocumentQuery
![Page 7: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/7.jpg)
Weight assignment (TFxIDF)Idea
Use per doc. Term Frequency (TF) to weight words (WD,t)
Use inverse global popularity (IDF) to find good discriminators among the query terms
IntuitionTF indicates how related a document is to a particular conceptInverse Document Frequency (IDF) identify the words that are good discriminators between documents
WD,t=f(Frequency of t in D)
IDFt=f(No. documents/Frequency of t across documents)
![Page 8: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/8.jpg)
Unfortunately IDF is not suited for P2PRequires an appearance count for every word in the community
We introduce the use of the Inverse Peer Frequency
IPFt=f(No. Peers/Peers with documents containing t)
IPF can be computed with local informationIPF is compatible across the community
Node & document ranking in PlanetP
![Page 9: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/9.jpg)
Stopping conditionIntuitive idea: Stop as soon as k documents are retrieved
Not goodA node might have few highly ranked documents and many that have a low rank
We propose an adaptive approach:Contact nodes one by one and keep a list of the top k documents retrievedStop contacting candidates when p nodes in a row fail to contribute to the top k
![Page 10: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/10.jpg)
Evaluation methodWe use five well known document collections
Each collection comes with a set of queries and relevance judgmentsHere we present results for one (AP89)
We measure recall and precision
Trace Queries DocumentsNumber of
wordsCollection size
(MBs)AP89 97 84678 129603 266.0
![Page 11: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/11.jpg)
Evaluation methodWe use a simulator to test our algorithm
Different file distributionsAgainst a central search engine Quantifying the effect not using an adaptive stopping condition
![Page 12: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/12.jpg)
Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 50 100 150 200 250 300No. documents requested
Pre
cisi
on
IDF
IPF Ad.W
IPF Ad.U
IPF First-k
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 50 100 150 200 250 300No. documents requested
Rec
all
IDF
IPF Ad.W
IPF Ad.U
IPF First-k
![Page 13: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/13.jpg)
Results cont.
0
50
100
150
200
250
300
350
400
450
0 50 100 150 200 250 300No. documents requested
No
. p
eers
co
nta
cted
IDF W
IPF Ad.W
IPF Ad.U
IPF First-k
![Page 14: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/14.jpg)
More resultsAdjusting the stop condition according to the community size and number of results expected
We provide a linear function to determine p
Recall as the community grows to 1000 (scalability)Overlap between PlanetP’s results and the ones obtained by using standard TFxIDF
80% on average
![Page 15: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/15.jpg)
ConclusionsPlanetP matches TFxIDF's performance using the TFxIPF approximation
Give P2P communities search capabilities as powerful as environments with centralized resourcesTFxIPF is applicable beyond PlanetPPlanetP matches TFxIDF’s performance regardless of how documents are distributed throughout the community
Our stopping heuristic limits searches to a small subset of the community yet allow enough peers to be contacted to guarantee good results
![Page 16: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/16.jpg)
Related WorkTapestry, Pastry, Chord and CAN
Implement a distributed hash table for P2P environmentsOriented towards name based searches (for FS)They already store all the information needed to implement TFxIPF
Cori and GlossAddress the problem of indexing and searching distributed collections of documentsThey build a centralized index that has total knowledge of word usage so they don’t contact unnecessary nodes
![Page 17: Text-Based Content Search and Retrieval in ad hoc P2P Communities](https://reader035.fdocuments.net/reader035/viewer/2022072017/5681348d550346895d9b755f/html5/thumbnails/17.jpg)
Questions?