Top-k Set Similarity Joins
description
Transcript of Top-k Set Similarity Joins
Top-k Set Similarity Joins
Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan ShangUniv. of New South Wales, AustrailiaICDE ’09
9 Feb 2011Taewhi Lee
Based on Chuan Xiao’s presentation slides in ICDE ’09
2 / 40
Outline
Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments
Motivation Data Cleaning
University City State Postal Code
University of New South Wales Sydney NSW 2052
University of Sydney Sydney NSW 2006
University of Melbourne Melbourne Victoria 3010
University of Queensland Brisbane Queensland 4072
University of New South Vales Sydney NSW 2052
3
More Application Near duplicate Web page detection
Obama Has Busy Final Day Before Taking Office as Bush Says Farewells
New York TimesJan 19th, 2009
iht.comJan 20, 2009
4
5 / 40
Outline
Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments
(Traditional) Set Similarity Join Each record is tokenized into a set Given a collection of records, the set similarity join
problem is to find all pairs of records, <x,y>, such that sim(x,y) t
Common similarity functions:
– jaccard:
– cosine:
– dice:
tyx
yxyxJ
),(
tyx
yxyxC
),(
x = {A,B,C,D,E}y = {B,C,D,E,F}
4/6 = 0.67
4/5 = 0.8
8/10 = 0.8tyx
yxyxD
2
),(
6
What if t is unknown beforehand?
What If t is Unknown Beforehand?
Example – using jaccard similarity function– w = {A, B, C, D, E}– x = {A, B, C, E, F}– y = {B, C, D, E, F}– z = {B, C, F, G, H}
– If t = 0.7 no results– If t = 0.4 <w,x>, <w,y>, <x,y>, <x,z>, <y,z>
(too many results and long running time)
Return the top-k results ranked by their simi-larity values– if k = 1 <w,x>
7
Top-k Set Similarity Join
Return top-k pairs of records, ranked by simi-larity scores
Advantages over traditional similarity join– Without specifying a threshold– Output results progressively benefit interactive
applications– Produce most meaningful results under limited re-
sources/time constraints Can be stopped at any time, but still guarantee
sim(output results) sim(unseen pairs)
8
9 / 40
Outline
Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments
Straightforward Solution Start from a certain t, repeat the following steps:
– answer traditional sim-join with t as threshold– if # of results k, stop and output k results with highest sim– else, decrease t
Example (jaccard, k = 2)– w = {A, B, C, E}– x = {A, B, C, E, F}– y = {B, C, D, E, F}– z = {B, C, F, G, H}
– t = 0.9 no result– t = 0.8 <w,x>– t = 0.7 <w,x>– t = 0.6 <w,x>, <x,y>
results don’t change!
Which thresholds shall we enumer-ate?
0.8, 0.6
10
Naïve and Index-Based Algorithms Naïve Algorithm:
– Compare every pair of objects -> O(n2) time complexity
Index-based Algorithm [Sarawagi et al. SIGMOD04]:
Record Set Index Construction
Candidate Generation
Verification Result Pairs
token record_id
A w x y
B x z …
C y z …<w,x>
<w,y>
<x,y>
<x,z>
…
inverted lists
11
Sort the tokens by a global ordering– increasing order of document frequency
Only need to index the first few tokens (prefix) for each record
Example: jaccard t = 0.8 |x y| 4 if |x|=|y|=5
A B
C Dupper boundO(x,y) = 3 < 4!
prefix
sorted
sorted
E F G
E F G
12
Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW07]
x
y
Must share at least one token in prefix to be a candidate pair– For jaccard, prefix length = |x| * (1 – t) + 1
each t is associated with a prefix length
13 / 40
Outline
Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments
Necessary Thresholds Each prefix is associated with a threshold
– the maximum possible similarity a record can achieve with other records
A B Cx =
1.0 0.8 0.6t
14
1.0 0.75 0.5 0.25x
y
z
1.0 0.8 0.6 0.4 0.2
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Event-driven Model
Problem: repeated invocation of sim-join algo-rithm– t is decreasing run sim-join algorithm in an in-
cremental way
Prefix Event <x, A, t>– Initialize prefix length for each record as 1 <x, A,
1.0> – For each prefix event
Probe the inverted list of the token for candidate pairs, verify the candidate pairs, and insert them into temp re-sults
Insert x into A’s inverted list Extend prefix by one token
maintain prefix events with a max-heap on t
– Stop until t k-th temp result’s similarity
15
Topk-join - Example
16
A B C E
A B C E F
B C D E F
B C F G H
w
x
y
z
token record_id
A w x
B y z x w
C y z
inverted list
<x, B, 0.8>
<y, C, 0.8>
<z, C, 0.8>
<w, B, 0.75>
prefix event
(w,x) = 0.8
(y,z) = 0.43
(x,y) = 0.67
temporary result
jaccard, k=2
verified twice!
t=0.6 2nd temp
result’s sim
Optimizations - Verification In the above example, (w,x) and (y,z) have been veri-
fied twice How to avoid repeated verification?
– Memorize all verified pairs with a hash table too much memory consumption
– Check if this pair will be identified again when it is verified for the first time
– Keep only those will be identified again before algorithm stops
– Guarantee no pair will be verified twice
A B D E F
A C D E F
x
y
1.0 0.8 0.6
if k-th temp result’s sim = 0.7
won’t be identified again!
17
Optimizations - Indexing How to reduce inverted list size to save memory?
– t is decreasing calculate the upper bound of similar-ity
for future probings into inverted lists
– Don’t insert into inverted list if upper bound k-th temp result’s similarity
A C D E F
B C D E F
x
y
18
0.8
max. similar-ity = 4/6 = 0.67
19 / 40
Outline
Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments
Experiment Settings Algorithms
– topk-join– pptopk: modified ppjoin[Xiao, et al. WWW08], a prefix-filter based
approach, with t = 0.95, 0.90, 0.85...
Measure– Compare topk-join and pptopk (candidate size, running time)– Output results progressively
Datasetdataset # of records avg. record size
DBLP (author, title) 855k 14.0
TREC (author, title, abstract) 348k 130.1
TREC-3GRAM 348k 868.5
UNIREF-3GRAM (protein seq.) 500k 372.9
20
Experiment Results
21
Experiment Results
22
Experiment Results
23
Thank You!Any questions or comments?
Related Work Index-based approaches
– S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004
– C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algo-rithms for approximate string searches. in ICDE, 2008
Prefix-based approaches– S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator
for similarity joins in data cleaning. In ICDE, 2006– R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs simi-
larity search. In WWW, 2007– C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins
for near duplicate detection. In WWW, 2008 PartEnum
– A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similar-ity joins. In VLDB, 2006
25