Post on 27-Jan-2017
Privacy Preserving RecordLinkage with PPJoin
Ziad Sehili, Lars Kolb, Christian Borgs,Rainer Schnell, Ergard Rahm
Datenbanksysteme für Business, Technologie und Web (BTW), 2015
September 15, 2016Presentation by Mateus Cruz
Introduction Preliminaries Proposal Experiments Conclusion
OUTLINE
1 Introduction
2 Preliminaries
3 Proposal
4 Experiments
5 Conclusion
Introduction Preliminaries Proposal Experiments Conclusion
OUTLINE
1 Introduction
2 Preliminaries
3 Proposal
4 Experiments
5 Conclusion
Introduction Preliminaries Proposal Experiments Conclusion
OVERVIEW
Find pairs of similar recordsQuadratic complexity
Ï Scalability problems
Adapt PPJoin1 to encrypted dataÏ Filtering reduces search space
Parallelize to improve performanceÏ GPUs
1Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu: “EfficientSimilarity Joins for Near Duplicate Detection”, WWW 2008
1 / 22
Introduction Preliminaries Proposal Experiments Conclusion
OUTLINE
1 Introduction
2 Preliminaries
3 Proposal
4 Experiments
5 Conclusion
Introduction Preliminaries Proposal Experiments Conclusion
DATA REPRESENTATION
Create Bloom filters using MD5 and SHA-1Ï Similarity preservingÏ Allows length filtering
2 / 22
Introduction Preliminaries Proposal Experiments Conclusion
DATA REPRESENTATION
Create Bloom filters using MD5 and SHA-1
Deterministic
Ï Similarity preservingÏ Allows length filtering
2 / 22
Introduction Preliminaries Proposal Experiments Conclusion
PPJOIN2
Position Prefix JoinSignature-based algorithmFiltering techniques
Ï Length filterÏ Prefix filterÏ Position filter
2Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu: “EfficientSimilarity Joins for Near Duplicate Detection”, WWW (2008)
3 / 22
Introduction Preliminaries Proposal Experiments Conclusion
LENGTH FILTER
“If two records are similar, the differencebetween their lengths cannot be large”
Sort records by lengthUsing Jaccard similarity: δ|s| ≤ |r| ≤ |s|
δÏ |s|: Length of sÏ δ: Similarity threshold
Group records according to their lengthsÏ Prune pairs of records from different groups
4 / 22
Introduction Preliminaries Proposal Experiments Conclusion
PREFIX FILTER
“If two records are similar,they must share some tokens”
Sort tokens in each recordÏ Alphabetical order, IDF order, etc
Select the p first tokensÏ For JS, p = b(1−δ)|s|c+1
Prune pairs for which sp ∩ rp 6= ;Ï sp: prefix of s (containing the first p tokens)
5 / 22
Introduction Preliminaries Proposal Experiments Conclusion
POSITION FILTER
“If two records are similar, their maximal overlapis smaller than the minimally needed overlap”
Minimal overlapÏ α= d t
1+t ∗ (|r|+ |s|)eDivide each record into left and right parts
Ï lp: tokens already seenÏ rp: unseen tokens
Prune if |lp(r)∩ lp(s)|+min(|rp(r)|, |rp(s)|) <α
6 / 22
Introduction Preliminaries Proposal Experiments Conclusion
PPJOIN PREPROCESSING
7 / 22
Introduction Preliminaries Proposal Experiments Conclusion
PPJOIN INDEX
Pair (r1,r4) filtered by length filterÏ |r1| < δ∗|r4| (4 < 0.8∗6)
Pair (r3,r2) filtered by position filter
8 / 22
Introduction Preliminaries Proposal Experiments Conclusion
OUTLINE
1 Introduction
2 Preliminaries
3 Proposal
4 Experiments
5 Conclusion
Introduction Preliminaries Proposal Experiments Conclusion
P4JOIN
“PPJoin for Encrypted Data” (P4Join)Records are BFs of fixed sizeConsider bit positions as tokensLength is the number of 1 bits
Ï Called cardinality
Does not need an inverted index
9 / 22
Introduction Preliminaries Proposal Experiments Conclusion
P4JOIN PREPROCESSING
Length is the number of 1 bits (cardinality)Ï Prefixes with same lengths, but different sizes
10 / 22
Introduction Preliminaries Proposal Experiments Conclusion
P4JOIN PROCESSING
High cost to maintain inverted indexOriginal position filter reduces performancelmap
Ï Lists relevant records based on length filter
11 / 22
Introduction Preliminaries Proposal Experiments Conclusion
P4JOIN LENGTH FILTER
r1 does not satisfy the length filterÏ 7 < 0.8∗11
12 / 22
Introduction Preliminaries Proposal Experiments Conclusion
P4JOIN PREFIX FILTER
Check overlap by AND operationÏ Prune pair (r4,r2)
– 000011011 AND 1111 = 000000000
13 / 22
Introduction Preliminaries Proposal Experiments Conclusion
P4JOIN POSITION FILTER
Prune pair (r4,r3)
14 / 22
Introduction Preliminaries Proposal Experiments Conclusion
P4JOIN WITH GPUS
Bit arrays of type long (64 bits)Divide R and S into partitions
Ï To fit in the GPU’s memory
Sort records in partitionsCheck if partitions have candidate pairs
Ï Using length filterÏ If no candidates, do not even send to GPU
15 / 22
Introduction Preliminaries Proposal Experiments Conclusion
P4JOIN GPU PROCESSING
One kernel per record of RiÏ Comparing with all records from SjÏ Prune using length and prefix filtersÏ Matches are saved in the global memory
16 / 22
Introduction Preliminaries Proposal Experiments Conclusion
OUTLINE
1 Introduction
2 Preliminaries
3 Proposal
4 Experiments
5 Conclusion
Introduction Preliminaries Proposal Experiments Conclusion
SETUP
HardwareÏ CPU 4-core 2.67 GHz, 4GB memoryÏ GPUs
– NVIDIA GeForce GT 610 (1GB memory)– NVIDIA GeForce GT 540M (1GB memory)
ParametersÏ Bigrams as tokensÏ Bit vector length: 1000Ï JS threshold: 0.8Ï Number of hash functions k = 20Ï Partitions maximum size: 2000
17 / 22
Introduction Preliminaries Proposal Experiments Conclusion
CPU PERFORMANCE
Most gains from length filterLarge overhead for prefix filter
18 / 22
Introduction Preliminaries Proposal Experiments Conclusion
GPU PERFORMANCE
Speedups of 20%Ï Compared to sequential CPU approach
19 / 22
Introduction Preliminaries Proposal Experiments Conclusion
OUTLINE
1 Introduction
2 Preliminaries
3 Proposal
4 Experiments
5 Conclusion
Introduction Preliminaries Proposal Experiments Conclusion
SUMMARY
Adaptation of PPJoin to PPRLÏ Records are encrypted bit arrays
Parallelization using GPUsBit arrays reduce effectiveness of filters
Ï Due to overheads
20 / 22
Algorithms Detailed Filters
EXTRA SLIDES
Algorithms Detailed Filters
P4JOIN ALGORITHM
Algorithms Detailed Filters
POSITION FILTER“If two records are similar, the upper bound of
their JS cannot be smaller than the threshold δ”Compute prefixes sp and rp
Calculate the upper bound of their JS (Θ):Ï Θ= |sp∩rp|+min(|s|−|sp|,|r|−|rp|)
|sp∪rp|+max(|s|−|sp|,|r|−|rp|)Prune the pair if Θ< δ
Exampler = {B,C,D,E,F},s = {A,B,C,D,F}, δ= 0.8 a
Θ= 1+33+3 = 4
6 ≈ 0.7 → prune pair (r,s)
aExample from Jiang et al.: “String similarity joins: An experimentalevaluation” VLDB (2014)