Top k string similarity search
-
Upload
chiao-meng-huang -
Category
Engineering
-
view
169 -
download
7
description
Transcript of Top k string similarity search
TOP-k String Similarity Search
Chiao-Meng Huang
Guanghao Peng
Liwen Hu
Qing Hu
Motivation
Top-k String Similarity Search
• Given a collection of strings and query string, return the top-k string with edit-distance constraints.
• EX:
▫ Search “shout” with K=5
▫ scout, shoot, short, shot, spout
Related Works
• Search by q-gram (Z. Yang et. al)
▫ Preprocessing string collections into inverted lists of q-gram
▫ Given a query string, calculate q-gram frequency. Retrieve top-k results based on q-gram and some distance metrics
Related Works
• Search with threshold (Z. Zhang et. al)
▫ Ordering the dictionary by string length and alphabetical order.
▫ Similar strings tends to be close in this ordered dictionary
▫ Some similar strings may scatter in different positions
▫ Divide query string into n-gram, and search it in high dimension space
Ex: database -> “da”, “at”, “ta”,”ab”….
Related Works
• Similarity join (J. Wang et. al)
▫ Given two sets of strings, find pair of strings belong to two sets that are similar
Ex: Given {kobe, ebay…}, {bag, koby}, returns <kobe, koby>
▫ Top-k search is a special case of similarity join that one of the input set contains only one string
Related Works
• Top-k similarity search by trie (J. Wang et. al)
▫ Construct a trie structure for input set
▫ Search the trie by increasing edit-distance
▫ Definition:
Pivot Entry<n, j, nc>
Node nc is node n’s child
ED(nc, q[1, j+1]) != ED(n, q[1, j])
Trie-based
• Given query q=“srajit”
• E0
▫ <n0, 0, n21>
▫ <n1, 1, n2>
▫ <n1, 1, n6>
▫ <n1, 1, n11>
Trie-based
• After substitution (increase j and goes down)
▫ <n0, 0, n21>
▫ to <n21, 1, n22>
▫ <n1, 1, n2>
▫ to <n2, 2, n3>
▫ …
Trie-based
• After insertion (goes down)
▫ <n0, 0, n21>
▫ to <n21, 0, n22>
▫ <n1, 1, n11>
▫ to <n11, 1, n12>
▫ and <n11, 1, n16>
Node n16 match the
rest of query (“rajit”)
Add “surajit” to result
Trie-based
• After deletion (increase j)
▫ <n0, 0, n21>
▫ to <n0, 1, n21>
▫ <n1, 1, n2>
▫ to <n1, 2, n2>
▫ …
Trie-based
• Applying substitution, insertion, deletion to E0 to extend it to E1 (find strings with ED=1 on the fly)
• Do the extension on Ei to Ei+1 until find k results
Trie-based
• More advanced version uses a range variable to include several entry pivots
▫ <n1, 1, n2>, <n1, 1, n6>, <n1, 1, n11> can be shorten as <1, 5, j, d>:
▫ Strings with id 1 to 5 are
pivot entries under depth d
and substring of query
from index j
Our Method
• Inspired by the trie-based appraoch
• Similar strings are still scattered around the trie
▫ symmetry and asymmetry
▫ shout and scout
• Solution: Applying clustering to remove similar strings
Clustering
Function cluster(S){map<string, vector<string>> clusters;while(S.length > 0){
s randomly select a string from ST find strings with one edit-distance with s from Sclusters[s] = T;erase T strings in S
}return clusters;
}
Clustered Top-k SearchFunction search(clusters, query, k){
construct primary trie Trie from centers of clustersconstruct secondary tries sTrie[i] from cluster IR = {};ActiveCenters = {};d = 0;while(R.size < k)
if(d == 0)ActiveCenters find initial pivot entry(trie, query)
elseActiveCenters ActiveCenters ∩ expend pivot entry(trie, query)
end iffor each center string i in ActiveCenters{
R = R ∩ find strings within edit-distance d in sTrie[i] with queryend ford++;
end while}
Clustered Top-k Search
Query: shout
Distance Active
Centers
shoot shouter shorter
0 shout
1 shoot shoot shoute
2 shouter shoots shouter shoter
3 shorter Shouters shorter
4 shortier
Evaluation
• Dataset A▫ Around 100,000 common English words
• Dataset B▫ Around 200,000 words▫ Dataset A plus additional suffix (dog, dogs)
• Dataset C▫ Around 200,000 words▫ Dataset A plus additional prefix (top, atop)
• Queries▫ Randomly select 100 words from the dataset
CPU Time
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CP
U T
ime
(s
)
Size K
CPU Time on Dataset A
Range
Cluster
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CP
U T
ime
(s
)
Size K
CPU Time on Dataset B (suffix)
DP
Range
CPU Time
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CP
U T
ime
(s
)
Size K
CPU Time on Dataset A
Range
Cluster
0
10
20
30
40
50
60
70
80
1 3 5 10 25 50 100 200 400
CP
U T
ime
(s
)
Size K
CPU Time on Dataset C (prefix)
Range
Cluster
Discussion
• With higher k, our method outperformed previous method
• Adding additional suffix words doesn’t affect the performance of previous method
• However, adding prefix decrease the performance, because prefix words are scattered in different position in trie
Entries
0
50000
100000
150000
200000
250000
1 3 5 10 25 50 100 200 400
# o
f E
ntr
ies
Size K
# of Entries on A
Cluster
Range
Time to Expand
0
0.5
1
1.5
2
2.5
3
3.5
0 1 2 3 4 5 6 7 8 9 10
CP
U T
ime
(s
)
xth entry
Average Time to Expand Pivot Entries
Range
Cluster
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 1 2 3 4 5 6 7 8 9 10
CP
U T
ime
(s
)
xth entry
Average Time to Expand Pivot Entries (Cluster)
Primary
Secondary
Scalability Study
0
5
10
15
20
25
30
35
40
1 3 5 10 25 50 100 200 400
CP
U T
ime
(s
)
Size K
CPU Time with Different Dataset Size
12500
25000
50000
100000
Clustering Study
0
5
10
15
20
25
30
35
40
45
50
1 3 5 10 25 50 100 200 400
CP
U T
ime
(s
)
Size K
CPU Time with Different # of Cluster Centers
56335
61347
70957
71036
Challenge and Future Work
• Dataset
▫ With too big dataset, we don’t have enough main memory to hold it
▫ With too small dataset, it tends to find solution with large edit-distance and becomes very slow
• Clustering
▫ It takes a lot of time to cluster data
▫ The resulting clusters are highly skewed that lots of them contains only one string
Task Breakdown
• Chiao-Meng Huang▫ Implemented range-based top-k string similarity search▫ Implemented our proposed method
• Guanghao Peng▫ Paper survey (search by threshold)▫ Drafting paper▫ Parsing and preparing dataset
• Liwen Hu▫ Paper survey (search by q-gram)▫ Drafting and finalizing our paper▫ Implemented base-line edit-distance metric (including dynamic
programming, progressive and pivotal entry based top-k search• Qing Hu
▫ Paper survey (similarity join)▫ Drafting paper▫ Parsing and preparing dataset