Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays
-
Upload
michail-argyriou -
Category
Technology
-
view
679 -
download
2
description
Transcript of Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays
Branch-and-bound nearest neighbor searching over unbalanced trie-structured overlays
Master’s Thesis Presentation Technical University of Crete
4.2.2013
Author: Michail Argyriou Supervisor: Ass’t Prof. Vasilis Samoladas
1998
1999
2000
2001
2002
Centralized Semi-distributed Fully-distributed
1999
2000
2001 2001 DHTs 2001
P2P Evolution
2
Distributed Hash Table (DHT)
3
PGrid • Rectangular queries support • Peers only on leaves • High-dimensional queries support with space filling curves
VBI • Height-balanced search tree limitation
GRaSP
• No height-balanced search tree limitation • Abstract types of data and queries • Data: point, rectangular • Queries: point, 3-sided, n-d rectangular
DHT Frameworks Evolution
2003:
2006:
2008:
4
Nearest neighbor search
5
Given a distributed data set how can we find the k most similar data to a query?
“k-Nearest Neighbor Search”
6
Applications
GIS Distributed Databases
Statistical Classification
Recommendation Systems
Cluster analysis Similarity Scores
7
Related Work
1. Naïve algorithm: Central peer collects data and performs k-NN searching
2. K-nn search algorithm over CAN
3. Distributed quad-based index each quadtree block is uniquely identified by its centroid mapped to Chord k-NN search algorithm
8
Contents
GRaSP
k-NN
Evaluation
Conclusions
9
GRaSP
10
Hierarchical space partition:
GRaSP Building the trie ...
Peer p joins
Finds a bootstrapping peer q
Space region s(q) splits into s(q0) and s(q1)
1
2
3 11
GRaSP Space Partition
Before
Volume-balanced
Before
Data-balanced
13
GRaSP Space Partition for a 3-sided query
14
GRaSP Space Partition for a 3-sided query
15
GRaSP Space Partition for a 3-sided query
16
GRaSP Data Insertion
We insert a key k into all peers who own regions that contain k
17
GRaSP Routing Tables
Each peer knows a peer in each complementary subtrie ...
0100 = 1 0100 = 00 0100 = 011 0100 = 0101
18
GRaSP Routing
“In order to route a message from peer p to peer q, the message is forwarded from p to a neighbor peer included in a known subtrie closer
to peer q. From r it is recursively forwarded to q.”
19
Contents
GRaSP
k-NN
Evaluation
Conclusions
20
Searching Algorithm Branch-and-bound algorithm
Priority queue PQ of candidate peers holding answer better than the k-th answer found so far Fringe
1. Branch Step: expand PQ 2. Bound Step: prune PQ
21
Searching Algorithm Parallel Searching vs Iterative Searching
Parallel Searching requires huge message state!
Iterative Searching prunes larger regions of the data space!
22
Searching Algorithm
23
Searching Algorithm Branch-and-bound algorithm
1? d(q,s(1)) < d(q,a) 00? d(q,s(1)) > d(q,a) 011? d(q,s(1)) > d(q,a) 0101? d(q,s(1)) < d(q,a)
24
Latency Complexity Theorem
Latency = |T|O(logn)
Support Set T:
25
Latency Complexity Theorem Proof
Peers visited:
Peers in T:
|T| peers
Find peer in the complementary subtrie: O(logn)
26
Contents
GRaSP
k-NN
Evaluation
Conclusions
27
Performance Evaluation Taking into account number of dimensions
Low Medium High
28
Performance Evaluation Metrics
• Data Fairness Index • Latency • Max Throughput • Fringe Size (mean, max)
29
Low Medium High
Low dimensions
30
Low dimensions Workloads
• Greece, data-balanced partition, k=1/10/100
• Greece, volume-balanced partition, k=1
Datasets
• Synthetic queries • For a network size of n peers we asked n/3
queries
Querysets
31
Low dimensions Which space partition is the best?
Volume-balanced
Data-balanced
32
Low dimensions Which space partition is the best?
Greece ...
Volume-balanced partition Data-balanced partition
Data FI vs
Space Partition
33
Greece, k=1 ...
Volume-balanced partition Data-balanced partition
Low dimensions Which space partition is the best?
Latency vs
Space Partition
34
Greece, k=1 ...
Volume-balanced partition Data-balanced partition
Low dimensions Which space partition is the best?
Fringe Size vs
Space Partition
35
Greece, k=1 ...
Volume-balanced partition Data-balanced partition
Low dimensions Which space partition is the best?
Max Throughput vs
Space Partition
36
Low dimensions Which space partition is the best?
Volume-balanced
Data-balanced
37
Low dimensions k ?
38
Low dimensions How is the size of the fringe
affected?
Fringe Size vs k
k=10
Greece, data-balanced partition ...
k=100 k=1 39
k=10
Low dimensions How is the latency affected?
Greece, data-balanced partition ...
k=100 k=1
Latency vs k
40
Low dimensions How is the Max. Throughput affected?
k=10
Greece, data-balanced partition ...
k=100 k=1
Max Throughput vs k
41
Low dimensions … efficient routing!
42
Low Medium High
Medium dimensions
43
Medium dimensions Workloads
• Uniform, volume-balanced partition, k=1 • ColorMoments, data-balanced partition,
k=1
Datasets
• Synthetic queries • For a network size of n peers we asked
n/3 queries
Querysets
44
Medium dimensions How is the size of the fringe
affected?
45
Medium dimensions How is the size of the fringe
affected?
ColorMoments, data-balanced, k=1 46
Medium dimensions How is the size of the fringe
affected?
Uniform, volume-balanced, k=1 Uniform, volume-balanced, k=1 Uniform, volume-balanced, k=1
Max. Fringe Size Mean Fringe Size 47
Medium dimensions Data Fairness Index
48
Medium dimensions Data Fairness Index
ColorMoments, data-balanced, k=1 49
Medium dimensions Data Fairness Index
Uniform, volume-balanced, k=1 50
Medium dimensions Latency
51
Medium dimensions Latency
ColorMoments, data-balanced, k=1 52
Medium dimensions Latency
Uniform, volume-balanced, k=1 53
Medium dimensions Latency
Latency is high but near to the optimum!
54
Medium dimensions Max. Throughput
55
Medium dimensions Max. Throughput
ColorMoments, data-balanced, k=1 56
Medium dimensions Max. Throughput
Uniform, volume-balanced, k=1 57
Medium dimensions … not efficient routing but near optimum!
It's still good enough for practical
applications!
58
Low Medium High
High dimensions
59
High dimensions Curse of dimensionality
“When the dimensionality increases, the volume of the space
increases so fast that the available data becomes sparse.”
60
Contents
GRaSP
k-NN
Evaluation
Conclusions
61
Conclusions
API
Searching (k-NN)
Trie Data Ins/Rem
Space Partition
Query Types Data Types
Metric Space 62
Future Work
Approximate k-NN searching for high
dimensions
Redundancy
63
THANK YOU
QUESTIONS ?
64