K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual...

58
1 K-repeating substrings: a string-algorithmic approach to privacy-preserving publication of texual data 2014-12-14, PACLIC 28 (Phuket) Yusuke Matsubara * Kôiti Hasida* *The University of Tokyo

Transcript of K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual...

Page 1: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

1

K-repeating substrings:a string-algorithmic approach to privacy-preserving publication of texual data

2014-12-14, PACLIC 28 (Phuket)

Yusuke Matsubara*Kôiti Hasida*

*The University of Tokyo

Page 2: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

2

Background: easier sharing of sensitive data with less labor

• Electronic health records contains sensitive information about individual patients. Removing sensitive information allows– easier data sharing between institutions or

between researchers and institutions– while protecting patients from potential privacy

violation

but requires high-cost human labor to annotate manually

Page 3: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

3

Task: Finding and hiding sensitive expression from text

• Patient names• Hospital

names• Some of the

disease names• Relevant

dates

Page 4: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

4

Task: Finding and hiding sensitive expression from text

• Patient names• Hospital

names• Some of the

disease names• Relevant

dates

Page 5: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

5

Previous work

Supervised learning & heuristic patterns• i2b2 challenge

(Uzuner et al., 2007)

PPDP for unstructured data (Liu, 2012)• Words or bags-of-

words as data values

Requires large training data,available in limited languages

Not fit to free-form text

Page 6: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

6

Frequency may inform sensitivity (to some extent)

1

2

3

4

5

6

7

8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Chances of being "sensitive" for a token with a certain frequency

from i2b2 deid training set PHI Non PHI

Fre

qu

en

cy

~10k

~10

~100

~1k

Page 7: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

7

Simple word frequency is not enough

• Common words may form a rare sequence, denoting a highly specific entity– “PINE STREET 10, SAN FRANCISCO, CA”

• Not trivial for morphologically complex languages/terminologies (using simple space delimiters may not be best)

Page 8: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

8

Goal of this work

• Distinguishing frequent/rare substrings efficiently which works as an alternative to word frequency

Page 9: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

9

Method

Page 10: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

10

Overview: two-step approach

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Finding maximum repeats with at least k occurrences

2. Suppressing regions not covered by non-overlapping, non-neighboring tiling

Page 11: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

11

Finding repeats: is there a span covered by no repeat?a b r a c a d a r a d a bb r

Page 12: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

12

Finding repeats: naively, examine all substrings (O(n2))a b r a c a d a r a d a bb

abra

adabr

abr

bra

ada

dab

ad

da

ab

br

ra

a

b

r

d

r

adab

Page 13: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

13

Finding repeats: only maximal ones are meaningfula b r a c a d a r a d a bb

abra

adabr

abr

bra

ada

dab

ad

da

ab

br

ra

a

b

r

d

r

adab

Page 14: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

14

Finding repeats: only maximal ones are meaningfula b r a c a d a r a d a bb

abra

adabr

abr

bra

ada

dab

ad

da

ab

br

ra

a

b

r

d

r

adab

k=2

Page 15: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

15

Finding repeats: higher threshold for “repeats”a b r a c a d a r a d a bb

abr

dab

ab

br

a

r

r

k=3

Page 16: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

16

Finding repeats

a b r a c a d a r a d a bbabr

dab

ab

br

a

r

r

How can we avoid the O(n2) enumerationand still find maximal repeats?

Page 17: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

17

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0034115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP T[SA[i]...]

T=abracadabradabr$

14 0 r$

(1) construct the suffix array and longest-common-prefix array of a given text (linear time)

Page 18: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

18

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0034115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

-------------

--

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 -r$ 13 b

(2) construct an array indicating repeating spans (linear time)

Page 19: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

19

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0034115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

-------------

12-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

Page 20: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

20

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

---0---------

12-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

4$

14 0 -r$ 13 b

Page 21: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

21

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

---0------7--

12-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 -r$ 13 b-

Page 22: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

22

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

---0------7--

12-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 -r$ 13 b

No update

Page 23: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

23

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

---0------7--

10-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 -r$ 13 b

Page 24: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

24

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

---0------5--

10-

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 -r$ 13 b

Page 25: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

25

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0------5--

10-

-

No update

Page 26: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

26

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0------5--

10-

-

No update

Page 27: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

27

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0------5--

10-

-

No update

Page 28: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

28

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0------5--

10-

-

Page 29: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

29

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0------5--

10-

-No update

Page 30: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

30

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0----6-5--

10-

-

Page 31: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

31

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0----6-5--

10-

-No update

Page 32: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

32

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0----6-5--

10-

-

-No update

Page 33: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

33

Maximum repeats (k=2) [Ilie+Smyth 2011]

1512073

105

13184

116

29

0014115023003

12

$abr$abracadabradabr$abradabr$acabdarabr$adabr$adabradabr$br$bracadabradabr$bradabr$cadabradabr$dab$dabradabr$

racadabradabr$radabr$

SA LCP

0123456789

101112

1415

T[SA[i]...]

T=abracadabradabr$

abracadabrada

r$

14 0 r$ 13 b

---0----6-5--

10-

-

Page 34: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

34

Maximum repeats (k=2) [Ilie+Smyth 2011]

0123456789

101112

1415

T=abracadabradabr$

abracadabrada

r$

13 b

---0----6-5--

10-

-

a b r a c a d a r a d a bb r

Page 35: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

35

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

---------------

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

Page 36: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

36

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

-------------

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

Page 37: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

37

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

----------

10--

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

Page 38: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

38

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

-------7--

10--

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

Page 39: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

39

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0------7--

10--

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

Page 40: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

40

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0--3---7--

10--

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

Page 41: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

41

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0--3-5-7--

10--

13-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

Page 42: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

42

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0--3-5-7--

10--

12-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

Page 43: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

43

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0--3-5-7--9--

12-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

Page 44: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

44

Maximum repeats (k=3, 4, ...) [proposed method]

1411131070358146

1292

000011110000002

$­ra$a$a­ra$abra­ra$abracadabra­ra$acadabra­ra$adabra­ra$bra­ra$bracadabra­ra$cadabra­ra$dabra­ra$ra$ra­ra$racadabra­ra$

SA gLCP

0--2-5-7--9--

12-

0123456789

1011121314

T[SA[i]...]

k=3 T=abracadabra­ra$

abracadabra­ra$

Page 45: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

45

Greedily tiling with the frequent-enough substrings

• Check from the longest ones– If it overlaps or neighbors with none of

accepted repeats, accept it

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Page 46: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

46

Greedily tiling with the frequent-enough substrings

• Check from the longest ones– If it overlaps or neighbors with none of

accepted repeats, accept it

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Page 47: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

47

Greedily tiling with the frequent-enough substrings

• Check from the longest ones– If it overlaps or neighbors with none of

accepted repeats, accept it

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Page 48: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

48

Greedily tiling with the frequent-enough substrings

• Output the spans that are not covered by any repeats

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Page 49: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

49

Greedily tiling with the frequent-enough substrings

• Output the spans that are not covered by any repeats

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Page 50: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

50

Experiments

Page 51: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

51

Experimental setups

• The i2b2 de-identification dataset (DEID)– 1.2 M bytes (227k words) [Uzuner et al 2007]– Gold standard annotated with PHI by experts– To show the effectiveness in identifying sensitive

information

• Japanese Wiktionary (WKT)– 45 M bytes– Contains long repetitive XML elements– To show the scalability

Page 52: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

52

Experimental setups: DEID

• Baselines (WORD)– Word frequency (with different thresholds)

• Proposed method 1 (MR)– Different k– Each repeat contains at least 6 characters– Words with more than 20% hidden are considered

completely hidden

• Proposed method 2 (MR+WORD)– Logical AND of WORD and MR on suppression

Page 53: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

53

DEID: recall-precision

Page 54: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

54

Time: linear-time computation in practice (WKT)

Page 55: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

55

Future work

• Combining with supervised methods• More space efficiency• Document (record) frequency • Other applications

– Term extraction– Unsupervised word segmentation– Near-duplicate detection

Page 56: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

56

Conclusions

• A suffix-array based method to detect rare sequences in text• As cheap as and more effective than simple

word frequency for text de-identification• Linear-time computation in theory and in

practice

Page 57: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

57

Future work (ongoing)

• Comparison against N-grams– Preview: higher-order ones (N=8, 9, 10, ...)

seem to be comparable, but with a higher computational cost

• Smarter optimization than greedy– Preview: dynamic-programming optimization

is possible, but gives marginal improvement only

Page 58: K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

58

Source code

https://github.com/whym/growthring