K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual...

Post on 25-Jul-2015

350 views 0 download

Tags:

Transcript of K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual...

1

K-repeating substrings:a string-algorithmic approach to privacy-preserving publication of texual data

2014-12-14, PACLIC 28 (Phuket)

Yusuke Matsubara*Kôiti Hasida*

*The University of Tokyo

2

Background: easier sharing of sensitive data with less labor

• Electronic health records contains sensitive information about individual patients. Removing sensitive information allows– easier data sharing between institutions or

between researchers and institutions– while protecting patients from potential privacy

violation

but requires high-cost human labor to annotate manually

3

Task: Finding and hiding sensitive expression from text

• Patient names• Hospital

names• Some of the

disease names• Relevant

dates

4

Task: Finding and hiding sensitive expression from text

• Patient names• Hospital

names• Some of the

disease names• Relevant

dates

5

Previous work

Supervised learning & heuristic patterns• i2b2 challenge

(Uzuner et al., 2007)

PPDP for unstructured data (Liu, 2012)• Words or bags-of-

words as data values

Requires large training data,available in limited languages

Not fit to free-form text

6

Frequency may inform sensitivity (to some extent)

1

2

3

4

5

6

7

8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Chances of being "sensitive" for a token with a certain frequency

from i2b2 deid training set PHI Non PHI

Fre

qu

en

cy

~10k

~10

~100

~1k

7

Simple word frequency is not enough

• Common words may form a rare sequence, denoting a highly specific entity– “PINE STREET 10, SAN FRANCISCO, CA”

• Not trivial for morphologically complex languages/terminologies (using simple space delimiters may not be best)

8

Goal of this work

• Distinguishing frequent/rare substrings efficiently which works as an alternative to word frequency

9

Method

10

Overview: two-step approach

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Finding maximum repeats with at least k occurrences

2. Suppressing regions not covered by non-overlapping, non-neighboring tiling

11

Finding repeats: is there a span covered by no repeat?a b r a c a d a r a d a bb r

12

Finding repeats: naively, examine all substrings (O(n2))a b r a c a d a r a d a bb

abra

adabr

abr

bra

ada

dab

ad

da

ab

br

ra

a

b

r

d

r

adab

13

Finding repeats: only maximal ones are meaningfula b r a c a d a r a d a bb

abra

adabr

abr

bra

ada

dab

ad

da

ab

br

ra

a

b

r

d

r

adab

14

Finding repeats: only maximal ones are meaningfula b r a c a d a r a d a bb

abra

adabr

abr

bra

ada

dab

ad

da

ab

br

ra

a

b

r

d

r

adab

k=2

15

Finding repeats: higher threshold for “repeats”a b r a c a d a r a d a bb

abr

dab

ab

br

a

r

r

k=3

16

Finding repeats

a b r a c a d a r a d a bbabr

dab

ab

br

a

r

r

How can we avoid the O(n2) enumerationand still find maximal repeats?

17

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0034115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP T[SA[i]...]

T=abracadabradabr$

14 0 r$

(1) construct the suffix array and longest-common-prefix array of a given text (linear time)

18

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0034115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

-------------

--

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 -r$ 13 b

(2) construct an array indicating repeating spans (linear time)

19

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0034115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

-------------

12-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

20

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

---0---------

12-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

4$

14 0 -r$ 13 b

21

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

---0------7--

12-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 -r$ 13 b-

22

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

---0------7--

12-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 -r$ 13 b

No update

23

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

---0------7--

10-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 -r$ 13 b

24

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

---0------5--

10-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 -r$ 13 b

25

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0------5--

10-

-

No update

26

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0------5--

10-

-

No update

27

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0------5--

10-

-

No update

28

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0------5--

10-

-

29

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0------5--

10-

-No update

30

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0----6-5--

10-

-

31

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0----6-5--

10-

-No update

32

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0----6-5--

10-

-

-No update

33

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0----6-5--

10-

-

34

Maximum repeats (k=2) [Ilie+Smyth 2011]

0123456789

101112

1415

T=abracadabradabr$

abracadabrada

r$

13 b

---0----6-5--

10-

-

a b r a c a d a r a d a bb r

35

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

---------------

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

36

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

-------------

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

37

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

----------

10--

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

38

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

-------7--

10--

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

39

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0------7--

10--

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

40

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0--3---7--

10--

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

41

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0--3-5-7--

10--

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

42

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0--3-5-7--

10--

12-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

43

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0--3-5-7--9--

12-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

44

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0--2-5-7--9--

12-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

45

Greedily tiling with the frequent-enough substrings

• Check from the longest ones– If it overlaps or neighbors with none of

accepted repeats, accept it

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

46

Greedily tiling with the frequent-enough substrings

• Check from the longest ones– If it overlaps or neighbors with none of

accepted repeats, accept it

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

47

Greedily tiling with the frequent-enough substrings

• Check from the longest ones– If it overlaps or neighbors with none of

accepted repeats, accept it

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

48

Greedily tiling with the frequent-enough substrings

• Output the spans that are not covered by any repeats

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

49

Greedily tiling with the frequent-enough substrings

• Output the spans that are not covered by any repeats

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

50

Experiments

51

Experimental setups

• The i2b2 de-identification dataset (DEID)– 1.2 M bytes (227k words) [Uzuner et al 2007]– Gold standard annotated with PHI by experts– To show the effectiveness in identifying sensitive

information

• Japanese Wiktionary (WKT)– 45 M bytes– Contains long repetitive XML elements– To show the scalability

52

Experimental setups: DEID

• Baselines (WORD)– Word frequency (with different thresholds)

• Proposed method 1 (MR)– Different k– Each repeat contains at least 6 characters– Words with more than 20% hidden are considered

completely hidden

• Proposed method 2 (MR+WORD)– Logical AND of WORD and MR on suppression

53

DEID: recall-precision

54

Time: linear-time computation in practice (WKT)

55

Future work

• Combining with supervised methods• More space efficiency• Document (record) frequency • Other applications

– Term extraction– Unsupervised word segmentation– Near-duplicate detection

56

Conclusions

• A suffix-array based method to detect rare sequences in text• As cheap as and more effective than simple

word frequency for text de-identification• Linear-time computation in theory and in

practice

57

Future work (ongoing)

• Comparison against N-grams– Preview: higher-order ones (N=8, 9, 10, ...)

seem to be comparable, but with a higher computational cost

• Smarter optimization than greedy– Preview: dynamic-programming optimization

is possible, but gives marginal improvement only

58

Source code

https://github.com/whym/growthring