K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual...
-
Upload
yusuke-matsubara -
Category
Technology
-
view
347 -
download
0
Transcript of K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual...
1
K-repeating substrings:a string-algorithmic approach to privacy-preserving publication of texual data
2014-12-14, PACLIC 28 (Phuket)
Yusuke Matsubara*Kôiti Hasida*
*The University of Tokyo
2
Background: easier sharing of sensitive data with less labor
• Electronic health records contains sensitive information about individual patients. Removing sensitive information allows– easier data sharing between institutions or
between researchers and institutions– while protecting patients from potential privacy
violation
but requires high-cost human labor to annotate manually
3
Task: Finding and hiding sensitive expression from text
• Patient names• Hospital
names• Some of the
disease names• Relevant
dates
4
Task: Finding and hiding sensitive expression from text
• Patient names• Hospital
names• Some of the
disease names• Relevant
dates
5
Previous work
Supervised learning & heuristic patterns• i2b2 challenge
(Uzuner et al., 2007)
PPDP for unstructured data (Liu, 2012)• Words or bags-of-
words as data values
Requires large training data,available in limited languages
Not fit to free-form text
6
Frequency may inform sensitivity (to some extent)
1
2
3
4
5
6
7
8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Chances of being "sensitive" for a token with a certain frequency
from i2b2 deid training set PHI Non PHI
Fre
qu
en
cy
~10k
~10
~100
~1k
7
Simple word frequency is not enough
• Common words may form a rare sequence, denoting a highly specific entity– “PINE STREET 10, SAN FRANCISCO, CA”
• Not trivial for morphologically complex languages/terminologies (using simple space delimiters may not be best)
8
Goal of this work
• Distinguishing frequent/rare substrings efficiently which works as an alternative to word frequency
9
Method
10
Overview: two-step approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. Finding maximum repeats with at least k occurrences
2. Suppressing regions not covered by non-overlapping, non-neighboring tiling
11
Finding repeats: is there a span covered by no repeat?a b r a c a d a r a d a bb r
12
Finding repeats: naively, examine all substrings (O(n2))a b r a c a d a r a d a bb
abra
adabr
abr
bra
ada
dab
ad
da
ab
br
ra
a
b
r
d
r
adab
13
Finding repeats: only maximal ones are meaningfula b r a c a d a r a d a bb
abra
adabr
abr
bra
ada
dab
ad
da
ab
br
ra
a
b
r
d
r
adab
14
Finding repeats: only maximal ones are meaningfula b r a c a d a r a d a bb
abra
adabr
abr
bra
ada
dab
ad
da
ab
br
ra
a
b
r
d
r
adab
k=2
15
Finding repeats: higher threshold for “repeats”a b r a c a d a r a d a bb
abr
dab
ab
br
a
r
r
k=3
16
Finding repeats
a b r a c a d a r a d a bbabr
dab
ab
br
a
r
r
How can we avoid the O(n2) enumerationand still find maximal repeats?
17
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0034115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP T[SA[i]...]
T=abracadabradabr$
14 0 r$
(1) construct the suffix array and longest-common-prefix array of a given text (linear time)
18
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0034115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
-------------
--
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 -r$ 13 b
(2) construct an array indicating repeating spans (linear time)
19
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0034115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
-------------
12-
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 r$ 13 b
20
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
---0---------
12-
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
4$
14 0 -r$ 13 b
21
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
---0------7--
12-
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 -r$ 13 b-
22
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
---0------7--
12-
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 -r$ 13 b
No update
23
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
---0------7--
10-
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 -r$ 13 b
24
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
---0------5--
10-
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 -r$ 13 b
25
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 r$ 13 b
---0------5--
10-
-
No update
26
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 r$ 13 b
---0------5--
10-
-
No update
27
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 r$ 13 b
---0------5--
10-
-
No update
28
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 r$ 13 b
---0------5--
10-
-
29
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 r$ 13 b
---0------5--
10-
-No update
30
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 r$ 13 b
---0----6-5--
10-
-
31
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 r$ 13 b
---0----6-5--
10-
-No update
32
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 r$ 13 b
---0----6-5--
10-
-
-No update
33
Maximum repeats (k=2) [Ilie+Smyth 2011]
1512073
105
13184
116
29
0014115023003
12
$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$
racadabradabr$radabr$
SA LCP
0123456789
101112
1415
T[SA[i]...]
T=abracadabradabr$
abracadabrada
r$
14 0 r$ 13 b
---0----6-5--
10-
-
34
Maximum repeats (k=2) [Ilie+Smyth 2011]
0123456789
101112
1415
T=abracadabradabr$
abracadabrada
r$
13 b
---0----6-5--
10-
-
a b r a c a d a r a d a bb r
35
Maximum repeats (k=3, 4, ...) [proposed method]
1411131070358146
1292
000011110000002
$ra$a$ara$abrara$abracadabrara$acadabrara$adabrara$brara$bracadabrara$cadabrara$dabrara$ra$rara$racadabrara$
SA gLCP
---------------
0123456789
1011121314
T[SA[i]...]
k=3 T=abracadabrara$
abracadabrara$
36
Maximum repeats (k=3, 4, ...) [proposed method]
1411131070358146
1292
000011110000002
$ra$a$ara$abrara$abracadabrara$acadabrara$adabrara$brara$bracadabrara$cadabrara$dabrara$ra$rara$racadabrara$
SA gLCP
-------------
13-
0123456789
1011121314
T[SA[i]...]
k=3 T=abracadabrara$
abracadabrara$
37
Maximum repeats (k=3, 4, ...) [proposed method]
1411131070358146
1292
000011110000002
$ra$a$ara$abrara$abracadabrara$acadabrara$adabrara$brara$bracadabrara$cadabrara$dabrara$ra$rara$racadabrara$
SA gLCP
----------
10--
13-
0123456789
1011121314
T[SA[i]...]
k=3 T=abracadabrara$
abracadabrara$
38
Maximum repeats (k=3, 4, ...) [proposed method]
1411131070358146
1292
000011110000002
$ra$a$ara$abrara$abracadabrara$acadabrara$adabrara$brara$bracadabrara$cadabrara$dabrara$ra$rara$racadabrara$
SA gLCP
-------7--
10--
13-
0123456789
1011121314
T[SA[i]...]
k=3 T=abracadabrara$
abracadabrara$
39
Maximum repeats (k=3, 4, ...) [proposed method]
1411131070358146
1292
000011110000002
$ra$a$ara$abrara$abracadabrara$acadabrara$adabrara$brara$bracadabrara$cadabrara$dabrara$ra$rara$racadabrara$
SA gLCP
0------7--
10--
13-
0123456789
1011121314
T[SA[i]...]
k=3 T=abracadabrara$
abracadabrara$
40
Maximum repeats (k=3, 4, ...) [proposed method]
1411131070358146
1292
000011110000002
$ra$a$ara$abrara$abracadabrara$acadabrara$adabrara$brara$bracadabrara$cadabrara$dabrara$ra$rara$racadabrara$
SA gLCP
0--3---7--
10--
13-
0123456789
1011121314
T[SA[i]...]
k=3 T=abracadabrara$
abracadabrara$
41
Maximum repeats (k=3, 4, ...) [proposed method]
1411131070358146
1292
000011110000002
$ra$a$ara$abrara$abracadabrara$acadabrara$adabrara$brara$bracadabrara$cadabrara$dabrara$ra$rara$racadabrara$
SA gLCP
0--3-5-7--
10--
13-
0123456789
1011121314
T[SA[i]...]
k=3 T=abracadabrara$
abracadabrara$
42
Maximum repeats (k=3, 4, ...) [proposed method]
1411131070358146
1292
000011110000002
$ra$a$ara$abrara$abracadabrara$acadabrara$adabrara$brara$bracadabrara$cadabrara$dabrara$ra$rara$racadabrara$
SA gLCP
0--3-5-7--
10--
12-
0123456789
1011121314
T[SA[i]...]
k=3 T=abracadabrara$
abracadabrara$
43
Maximum repeats (k=3, 4, ...) [proposed method]
1411131070358146
1292
000011110000002
$ra$a$ara$abrara$abracadabrara$acadabrara$adabrara$brara$bracadabrara$cadabrara$dabrara$ra$rara$racadabrara$
SA gLCP
0--3-5-7--9--
12-
0123456789
1011121314
T[SA[i]...]
k=3 T=abracadabrara$
abracadabrara$
44
Maximum repeats (k=3, 4, ...) [proposed method]
1411131070358146
1292
000011110000002
$ra$a$ara$abrara$abracadabrara$acadabrara$adabrara$brara$bracadabrara$cadabrara$dabrara$ra$rara$racadabrara$
SA gLCP
0--2-5-7--9--
12-
0123456789
1011121314
T[SA[i]...]
k=3 T=abracadabrara$
abracadabrara$
45
Greedily tiling with the frequent-enough substrings
• Check from the longest ones– If it overlaps or neighbors with none of
accepted repeats, accept it
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46
Greedily tiling with the frequent-enough substrings
• Check from the longest ones– If it overlaps or neighbors with none of
accepted repeats, accept it
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
47
Greedily tiling with the frequent-enough substrings
• Check from the longest ones– If it overlaps or neighbors with none of
accepted repeats, accept it
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
48
Greedily tiling with the frequent-enough substrings
• Output the spans that are not covered by any repeats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49
Greedily tiling with the frequent-enough substrings
• Output the spans that are not covered by any repeats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
50
Experiments
51
Experimental setups
• The i2b2 de-identification dataset (DEID)– 1.2 M bytes (227k words) [Uzuner et al 2007]– Gold standard annotated with PHI by experts– To show the effectiveness in identifying sensitive
information
• Japanese Wiktionary (WKT)– 45 M bytes– Contains long repetitive XML elements– To show the scalability
52
Experimental setups: DEID
• Baselines (WORD)– Word frequency (with different thresholds)
• Proposed method 1 (MR)– Different k– Each repeat contains at least 6 characters– Words with more than 20% hidden are considered
completely hidden
• Proposed method 2 (MR+WORD)– Logical AND of WORD and MR on suppression
53
DEID: recall-precision
54
Time: linear-time computation in practice (WKT)
55
Future work
• Combining with supervised methods• More space efficiency• Document (record) frequency • Other applications
– Term extraction– Unsupervised word segmentation– Near-duplicate detection
56
Conclusions
• A suffix-array based method to detect rare sequences in text• As cheap as and more effective than simple
word frequency for text de-identification• Linear-time computation in theory and in
practice
57
Future work (ongoing)
• Comparison against N-grams– Preview: higher-order ones (N=8, 9, 10, ...)
seem to be comparable, but with a higher computational cost
• Smarter optimization than greedy– Preview: dynamic-programming optimization
is possible, but gives marginal improvement only