Top k string similarity search

TOP-k String Similarity Search

Chiao-Meng Huang

Guanghao Peng

Liwen Hu

Qing Hu

Motivation

Top-k String Similarity Search

• Given a collection of strings and query string, return the top-k string with edit-distance constraints.

• EX:

▫ Search “shout” with K=5

▫ scout, shoot, short, shot, spout

Related Works

• Search by q-gram (Z. Yang et. al)

▫ Preprocessing string collections into inverted lists of q-gram

▫ Given a query string, calculate q-gram frequency. Retrieve top-k results based on q-gram and some distance metrics

Related Works

• Search with threshold (Z. Zhang et. al)

▫ Ordering the dictionary by string length and alphabetical order.

▫ Similar strings tends to be close in this ordered dictionary

▫ Some similar strings may scatter in different positions

▫ Divide query string into n-gram, and search it in high dimension space

Ex: database -> “da”, “at”, “ta”,”ab”….

Related Works

• Similarity join (J. Wang et. al)

▫ Given two sets of strings, find pair of strings belong to two sets that are similar

Ex: Given {kobe, ebay…}, {bag, koby}, returns <kobe, koby>

▫ Top-k search is a special case of similarity join that one of the input set contains only one string

Related Works

• Top-k similarity search by trie (J. Wang et. al)

▫ Construct a trie structure for input set

▫ Search the trie by increasing edit-distance

▫ Definition:

Pivot Entry<n, j, nc>

Node nc is node n’s child

ED(nc, q[1, j+1]) != ED(n, q[1, j])

Trie-based

• Given query q=“srajit”

• E0

▫ <n0, 0, n21>

▫ <n1, 1, n2>

▫ <n1, 1, n6>

▫ <n1, 1, n11>

Trie-based

• After substitution (increase j and goes down)

▫ <n0, 0, n21>

▫ to <n21, 1, n22>

▫ <n1, 1, n2>

▫ to <n2, 2, n3>

▫ …

Trie-based

• After insertion (goes down)

▫ <n0, 0, n21>

▫ to <n21, 0, n22>

▫ <n1, 1, n11>

▫ to <n11, 1, n12>

▫ and <n11, 1, n16>

Node n16 match the

rest of query (“rajit”)

Add “surajit” to result

Trie-based

• After deletion (increase j)

▫ <n0, 0, n21>

▫ to <n0, 1, n21>

▫ <n1, 1, n2>

▫ to <n1, 2, n2>

▫ …

Trie-based

• Applying substitution, insertion, deletion to E0 to extend it to E1 (find strings with ED=1 on the fly)

• Do the extension on Ei to Ei+1 until find k results

Trie-based

• More advanced version uses a range variable to include several entry pivots

▫ <n1, 1, n2>, <n1, 1, n6>, <n1, 1, n11> can be shorten as <1, 5, j, d>:

▫ Strings with id 1 to 5 are

pivot entries under depth d

and substring of query

from index j

Our Method

• Inspired by the trie-based appraoch

• Similar strings are still scattered around the trie

▫ symmetry and asymmetry

▫ shout and scout

• Solution: Applying clustering to remove similar strings

Clustering

Function cluster(S){map<string, vector<string>> clusters;while(S.length > 0){

s randomly select a string from ST find strings with one edit-distance with s from Sclusters[s] = T;erase T strings in S

}return clusters;

}

Clustered Top-k SearchFunction search(clusters, query, k){

construct primary trie Trie from centers of clustersconstruct secondary tries sTrie[i] from cluster IR = {};ActiveCenters = {};d = 0;while(R.size < k)

if(d == 0)ActiveCenters find initial pivot entry(trie, query)

elseActiveCenters ActiveCenters ∩ expend pivot entry(trie, query)

end iffor each center string i in ActiveCenters{

R = R ∩ find strings within edit-distance d in sTrie[i] with queryend ford++;

end while}

Clustered Top-k Search

Query: shout

Distance Active

Centers

shoot shouter shorter

0 shout

1 shoot shoot shoute

2 shouter shoots shouter shoter

3 shorter Shouters shorter

4 shortier

Evaluation

• Dataset A▫ Around 100,000 common English words

• Dataset B▫ Around 200,000 words▫ Dataset A plus additional suffix (dog, dogs)

• Dataset C▫ Around 200,000 words▫ Dataset A plus additional prefix (top, atop)

• Queries▫ Randomly select 100 words from the dataset

CPU Time

0

5

10

15

20

25

30

35

40

45

1 3 5 10 25 50 100 200 400

CP

U T

ime

(s

)

Size K

CPU Time on Dataset A

Range

Cluster

0

5

10

15

20

25

30

35

40

45

1 3 5 10 25 50 100 200 400

CP

U T

ime

(s

)

Size K

CPU Time on Dataset B (suffix)

DP

Range

CPU Time

0

5

10

15

20

25

30

35

40

45

1 3 5 10 25 50 100 200 400

CP

U T

ime

(s

)

Size K

CPU Time on Dataset A

Range

Cluster

0

10

20

30

40

50

60

70

80

1 3 5 10 25 50 100 200 400

CP

U T

ime

(s

)

Size K

CPU Time on Dataset C (prefix)

Range

Cluster

Discussion

• With higher k, our method outperformed previous method

• Adding additional suffix words doesn’t affect the performance of previous method

• However, adding prefix decrease the performance, because prefix words are scattered in different position in trie

Entries

0

50000

100000

150000

200000

250000

1 3 5 10 25 50 100 200 400

# o

f E

ntr

ies

Size K

# of Entries on A

Cluster

Range

Time to Expand

0

0.5

1

1.5

2

2.5

3

3.5

0 1 2 3 4 5 6 7 8 9 10

CP

U T

ime

(s

)

xth entry

Average Time to Expand Pivot Entries

Range

Cluster

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 1 2 3 4 5 6 7 8 9 10

CP

U T

ime

(s

)

xth entry

Average Time to Expand Pivot Entries (Cluster)

Primary

Secondary

Scalability Study

0

5

10

15

20

25

30

35

40

1 3 5 10 25 50 100 200 400

CP

U T

ime

(s

)

Size K

CPU Time with Different Dataset Size

12500

25000

50000

100000

Clustering Study

0

5

10

15

20

25

30

35

40

45

50

1 3 5 10 25 50 100 200 400

CP

U T

ime

(s

)

Size K

CPU Time with Different # of Cluster Centers

56335

61347

70957

71036

Challenge and Future Work

• Dataset

▫ With too big dataset, we don’t have enough main memory to hold it

▫ With too small dataset, it tends to find solution with large edit-distance and becomes very slow

• Clustering

▫ It takes a lot of time to cluster data

▫ The resulting clusters are highly skewed that lots of them contains only one string

Task Breakdown

• Chiao-Meng Huang▫ Implemented range-based top-k string similarity search▫ Implemented our proposed method

• Guanghao Peng▫ Paper survey (search by threshold)▫ Drafting paper▫ Parsing and preparing dataset

• Liwen Hu▫ Paper survey (search by q-gram)▫ Drafting and finalizing our paper▫ Implemented base-line edit-distance metric (including dynamic

programming, progressive and pivotal entry based top-k search• Qing Hu

▫ Paper survey (similarity join)▫ Drafting paper▫ Parsing and preparing dataset

Top k string similarity search

Engineering

Transcript of Top k string similarity search