Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.
-
Upload
vivien-barnett -
Category
Documents
-
view
219 -
download
0
Transcript of Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.
Similarity join problem with Pass-Join-K using Hadoop
---BY Yu Haiyang
23/4/19 http://datamining.xmu.edu.cn 2/32
Outline
Background
The introduction of Pass-Join-K
Combining Pass-Join-K with Hadoop
23/4/19 http://datamining.xmu.edu.cn 3/32
Background
Similarity join: Find all similar pairs
from two sets.
Data Cleaning.
Query Relaxation
Spellchecking
“PO BOX 23, Main St.” “P.O. Box 23, Main St”
“information”“imformation”
23/4/19 http://datamining.xmu.edu.cn 4/32
Background
How to define similarity?
Jaccard distance
Cosine distance
Edit distance
23/4/19 http://datamining.xmu.edu.cn 5/32
Background
Edit distance
The minimum number of edit
operations (insertion, deletion, and
substitution) to transform one string to
another.
Baby BodySubstitution
Bod BodyInsertion
23/4/19 http://datamining.xmu.edu.cn 6/32
Background
How does the edit distance compare
with other two?
Accuracy: {“abcdefg”,”gfedcba”}
Verification time: O(mn) -> O(m+n)
23/4/19 http://datamining.xmu.edu.cn 7/32
Background
Find similar pairs
We have two string sets ,one is
{vldb,sigmod,….} ,the other is
{pvldb,icde,…}.
Find some candidate pairs , and then
verify these pairs.
{<vldb,pvldb>,<vldb,icde>,<vldb,..>,<sigmod,pvldb>,<sigmod,icde>,….}
<vldb,pvldb> Yes <vldb,icde> No
23/4/19 http://datamining.xmu.edu.cn 8/32
Background
So we have to:
Finding candidate pairs. There are
O(N2) if we do not prune some pairs.
verifying these pairs.
O(mn)
23/4/19 http://datamining.xmu.edu.cn 9/32
Introduction of Pass-Join-K
Some obvious pruning techniques
Length –based: threshold =
2,<“ab”,”abcee”>
Shift-based: <“abcd”,”cdef”>a b c d
c d e f
23/4/19 http://datamining.xmu.edu.cn 10/32
Introduction of Pass-Join-K
Partition-based pruning technique
We suppose the threshold tau = 2,
K=2and we have a pair
<“abcdefghijk”,”abdefghk”>abc def ghi jk
ab def gh k
23/4/19 http://datamining.xmu.edu.cn 11/32
Introduction of Pass-Join-K
Partition Scheme
We have seen that the longer the
substrings are, the harder they could be
marched.
So we break the string into tau+k parts
and each part while its length equals
length/(tau+k) or length/(tau+k)+1.
23/4/19 http://datamining.xmu.edu.cn 12/32
Introduction of Pass-Join-K
Partition Scheme
So we break the string into tau+k parts
and each part while its length equals
length/(tau+k) or length/(tau+k)+1.
abc def ghi jk
23/4/19 http://datamining.xmu.edu.cn 13/32
Introduction of Pass-Join-K
Partition Scheme
r = “abcdefghijk” s = “abdefghk”
abc def ghi jk
L11L11
1122 33 44
rr rr rr rr
defdef
23/4/19 http://datamining.xmu.edu.cn 14/32
Introduction of Pass-Join-K
Substring Selection
Here we suppose tau = 3 and k = 1;
abc def ghi jk
a b d e f g h ka b d e f g h k
23/4/19 http://datamining.xmu.edu.cn 15/32
Introduction of Pass-Join-K
Substring Selection
Here we suppose tau = 3 and k = 1;
abc def ghi jk
a b d e f g h k
23/4/19 http://datamining.xmu.edu.cn 16/32
Introduction of Pass-Join-K
Substring Selection
Here we suppose tau = 3 and k = 1;
abc def ghi jk
a b d e f gh k
23/4/19 http://datamining.xmu.edu.cn 17/32
Introduction of Pass-Join-K
Substring Selection
Here we suppose tau = 3 and k = 1;
abc def ghi jk
abd efg hk
23/4/19 http://datamining.xmu.edu.cn 18/32
Introduction of Pass-Join-K
Substring Selection
Here we suppose tau = 3 and k = 1;
abc def ghi jk
a b d e f g h ka b d e f g h k
23/4/19 http://datamining.xmu.edu.cn 19/32
Introduction of Pass-Join-K
Substring Selection
So what we do is to deduce the number
of substrings. More pruning techniques,
please read our paper: 《 Pass-Join-K多分段匹配的相似性连接算法》
23/4/19 http://datamining.xmu.edu.cn 20/32
Introduction of Pass-Join-K
Verification
DP( Dynamic programming)
• D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-
1,n-1)+flag) where flag = 1 when sm=rn , s
and r are both strings.
23/4/19 http://datamining.xmu.edu.cn 21/32
Introduction of Pass-Join-K
Verification
Here we suppose tau = 3 and k = 1;
abc def ghi jk
def e f g h kTauleft = 3Tauleft = 3
Tauright = 3-3=0Tauright = 3-3=0
23/4/19 http://datamining.xmu.edu.cn 22/32
Combining Pass-Join-K with Hadoop
Inverted index tree in hadoop
(abc, 1, 11,r) (def,2,11,r) (ghi,3,11,r)
(jk,4,11,r)
abc def ghi jk
1122 33 44
rr rr rr rr
L11L11
23/4/19 http://datamining.xmu.edu.cn 23/32
Combining Pass-Join-K with Hadoop
Substrings in hadoop
Suppose tau = 3, k = 1, and s =
“abdefghk”, length(s) = 8. We have to
generate some records such as (a,1,5,s),
(a,2,6,s)(a,3,7,s),(ab,1,8,s),…,
(ab,1,11,s),…
23/4/19 http://datamining.xmu.edu.cn 24/32
Combining Pass-Join-K with Hadoop
Substrings in hadoop
Suppose tau = 3, k = 1, and s = “abdefghk”,
length(s) = 8. We have to generate more than
2*tau*(tau+k)*m records where m is the
average number that substring for each
segment, such as (a,1,5,s),(a,1,6,s)(a,1,7,s),
(ab,1,8,s),…,(ab,1,11,s),…
23/4/19 http://datamining.xmu.edu.cn 25/32
Combining Pass-Join-K with Hadoop
Data flows in hadoop
23/4/19 http://datamining.xmu.edu.cn 26/32
Combining Pass-Join-K with Hadoop
How to improve the performance ?
We have known that as k increased ,
the pairs we need to verity would be
decrease.
As k increased, more than
(tau+k+1)/(tau+k) records should be
translated in Mapper phase.
23/4/19 http://datamining.xmu.edu.cn 27/32
Combining Pass-Join-K with Hadoop
Here we have 2 ways to improve our
algorithm.
Finding a dataset that the candidate
pairs number are large enough or
making tau are large enough.
Decreasing the data which were
generated in Mapper phase.
23/4/19 http://datamining.xmu.edu.cn 28/32
Combining Pass-Join-K with Hadoop
Decrease the data flows
23/4/19 http://datamining.xmu.edu.cn 29/32
Combining Pass-Join-K with Hadoop
Decrease the data flows
The inverted index record was formulated as
(substring,segmentNumber, LengthInf, Id, flag)
• Each record’s length is length(substring)
+4*sizeof(int), and substring sometimes could be so
long.
• Hash(substring) -> integer, then record length is
5*sizeof(int)
23/4/19 http://datamining.xmu.edu.cn 30/32
Combining Pass-Join-K with Hadoop
Decrease the data flows
The substring would generate some similar
records such as (a,1,5,s),(a,1,6,s)(a,1,7,s)
…
• Each substring would generate tau+k similar
segments, so we combine them as ,for example,
(a,1,5,7,s). So we make the (tau+k)*4*sizeof(int) to
5*sizeof(int).
23/4/19 http://datamining.xmu.edu.cn 31/32
Combining Pass-Join-K with Hadoop
Decrease the data flows So by using two steps we have seen before, we have
reduced the (length(substring)+4*sizeof(int))*(tau+k)
to 5 times sizeof(int)
23/4/19 http://datamining.xmu.edu.cn 32/32
Email: [email protected]