ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

22
ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing Xin Jin a , Nan Zhang a,1, , Gautam Das b,2 a Department of Computer Science, George Washington University, 20052, United States b Department of Computer Science and Engineering, University of Texas at Arlington, 76019, United States article info Article history: Received 25 August 2010 Received in revised form 10 January 2011 Accepted 8 March 2011 Recommended by L. Wong Available online 15 March 2011 Keywords: Privacy preservation Data publishing Algorithm-based disclosure Algorithm-SAfe Publishing abstract Numerous privacy-preserving data publishing algorithms were proposed to achieve privacy guarantees such as -diversity. Many of them, however, were recently found to be vulnerable to algorithm-based disclosurei.e., privacy leakage incurred by an adversary who is aware of the privacy-preserving algorithm being used. This paper describes generic techniques for correcting the design of existing privacy-preserving data publishing algorithms to eliminate algorithm-based disclosure. We first show that algorithm-based disclosure is more prevalent and serious than previously studied. Then, we strictly define Algorithm-SAfe Publishing (ASAP) to capture and eliminate threats from algorithm-based disclosure. To correct the problems of existing data publishing algorithms, we propose two generic tools to be integrated in their design: global look-ahead and local look-ahead. To enhance data utility, we propose another generic tool called stratified pick-up. We demonstrate the effectiveness of our tools by applying them to several popular -diversity algorithms: Mondrian, Hilb, and MASK. We conduct extensive experiments to demonstrate the effectiveness of our tools in terms of data utility and efficiency. & 2011 Elsevier B.V. All rights reserved. 1. Introduction 1.1. Privacy-preserving data publishing Many organizations, such as hospitals, require publish- ing microdata with personal information, such as medical records, for facilitating research and serving public inter- ests. Nonetheless, such publication may incur privacy concerns for the individual owners of tuples being pub- lished (e.g., patients). To address this challenge, privacy- preserving data publishing (i.e., PPDP) was proposed to generate the published table in a way that enables analytical tasks (e.g., aggregate query answering, data mining) over the published data, while protecting the privacy of individual data owners. In general, a microdata table (denoted by T) can contain three types of attributes: (1) personal identifiable attri- butes (e.g., SSN), each of which is an explicitly unique identifier of an individual, (2) quasi-identifier (QI) attri- butes (e.g., Age, Sex, Country), which are not explicit identifiers but, when combined together, can be empiri- cally unique for each individual, and (3) sensitive attributes (SA) (e.g., Disease), each of which contains a sensitive value (set) that must be protected. In privacy-preserving data publishing, personal identifiable attributes are usually removed prior to publishing. QI and/or SA attri- butes are perturbed to achieve a pre-defined privacy model while maximizing the utility of published data. Samarati and Sweeney [1] first defined a privacy model, k-anonymity, for PPDP. It requires each tuple in the published table (denoted by T n ) to have at least k 1 other QI-indistinguishable tuplesi.e., tuples with the same QI attribute values. To protect individual SA infor- mation, Machanavajjhala et al. [2] introduced another Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/infosys Information Systems 0306-4379/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2011.03.001 Corresponding author. Tel.: þ1 2029945919; fax: þ1 2029944875. E-mail addresses: [email protected] (X. Jin), [email protected] (N. Zhang), [email protected] (G. Das). 1 Partially supported by NSF grants 0852673, 0852674, 0845644 and 0915834 and a GWU Research Enhancement Fund. 2 Partially supported by NSF grants 0845644, 0812601 and 0915834 and grants from Microsoft Research and Nokia Research. Information Systems 36 (2011) 859–880

Transcript of ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

Page 1: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

Contents lists available at ScienceDirect

Information Systems

Information Systems 36 (2011) 859–880

0306-43

doi:10.1

� Cor

E-m

nzhang1 Pa

0915832 Pa

and gra

journal homepage: www.elsevier.com/locate/infosys

ASAP: Eliminating algorithm-based disclosure inprivacy-preserving data publishing

Xin Jin a, Nan Zhang a,1,�, Gautam Das b,2

a Department of Computer Science, George Washington University, 20052, United Statesb Department of Computer Science and Engineering, University of Texas at Arlington, 76019, United States

a r t i c l e i n f o

Article history:

Received 25 August 2010

Received in revised form

10 January 2011

Accepted 8 March 2011

Recommended by L. Wongtechniques for correcting the design of existing privacy-preserving data publishing

Available online 15 March 2011

Keywords:

Privacy preservation

Data publishing

Algorithm-based disclosure

Algorithm-SAfe Publishing

79/$ - see front matter & 2011 Elsevier B.V. A

016/j.is.2011.03.001

responding author. Tel.: þ1 2029945919; fax

ail addresses: [email protected] (X. Jin),

[email protected] (N. Zhang), [email protected] (G. Da

rtially supported by NSF grants 0852673, 085

4 and a GWU Research Enhancement Fund.

rtially supported by NSF grants 0845644, 081

nts from Microsoft Research and Nokia Resea

a b s t r a c t

Numerous privacy-preserving data publishing algorithms were proposed to achieve

privacy guarantees such as ‘-diversity. Many of them, however, were recently found to

be vulnerable to algorithm-based disclosure—i.e., privacy leakage incurred by an adversary

who is aware of the privacy-preserving algorithm being used. This paper describes generic

algorithms to eliminate algorithm-based disclosure. We first show that algorithm-based

disclosure is more prevalent and serious than previously studied. Then, we strictly define

Algorithm-SAfe Publishing (ASAP) to capture and eliminate threats from algorithm-based

disclosure. To correct the problems of existing data publishing algorithms, we propose two

generic tools to be integrated in their design: global look-ahead and local look-ahead. To

enhance data utility, we propose another generic tool called stratified pick-up. We

demonstrate the effectiveness of our tools by applying them to several popular

‘-diversity algorithms: Mondrian, Hilb, and MASK. We conduct extensive experiments to

demonstrate the effectiveness of our tools in terms of data utility and efficiency.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

1.1. Privacy-preserving data publishing

Many organizations, such as hospitals, require publish-ing microdata with personal information, such as medicalrecords, for facilitating research and serving public inter-ests. Nonetheless, such publication may incur privacyconcerns for the individual owners of tuples being pub-lished (e.g., patients). To address this challenge, privacy-preserving data publishing (i.e., PPDP) was proposed togenerate the published table in a way that enablesanalytical tasks (e.g., aggregate query answering, data

ll rights reserved.

: þ1 2029944875.

s).

2674, 0845644 and

2601 and 0915834

rch.

mining) over the published data, while protecting theprivacy of individual data owners.

In general, a microdata table (denoted by T) can containthree types of attributes: (1) personal identifiable attri-butes (e.g., SSN), each of which is an explicitly uniqueidentifier of an individual, (2) quasi-identifier (QI) attri-butes (e.g., Age, Sex, Country), which are not explicitidentifiers but, when combined together, can be empiri-cally unique for each individual, and (3) sensitive attributes

(SA) (e.g., Disease), each of which contains a sensitivevalue (set) that must be protected. In privacy-preservingdata publishing, personal identifiable attributes areusually removed prior to publishing. QI and/or SA attri-butes are perturbed to achieve a pre-defined privacymodel while maximizing the utility of published data.

Samarati and Sweeney [1] first defined a privacymodel, k-anonymity, for PPDP. It requires each tuple inthe published table (denoted by Tn) to have at least k�1other QI-indistinguishable tuples—i.e., tuples with thesame QI attribute values. To protect individual SA infor-mation, Machanavajjhala et al. [2] introduced another

Page 2: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

X. Jin et al. / Information Systems 36 (2011) 859–880860

privacy model, ‘-diversity, which further requires eachgroup of QI-indistinguishable tuples to have diverse SAvalues. Variations of the ‘-diversity include ða,kÞ-anonymity [3], t-closeness [4], (k, e)-anonymity [5],m-invariance [6], etc. To satisfy these privacy models,numerous PPDP algorithms have been proposed [5,7–11].

1.2. Algorithm-based disclosure

It was traditionally believed that, to determine whether aprivacy model is properly satisfied, one only needs to look atthe published table, i.e., the output of a data publishingalgorithm, without investigating the algorithm itself. Therecently discovered algorithm-based disclosure [12] contra-dicts this traditional belief as it demonstrates that privacydisclosure can be incurred by the design of a data publishingalgorithm. In particular, if a privacy-preserving algorithm isvulnerable to algorithm-based disclosure, then once anadversary learns the design of the algorithm, s/he may utilizethis knowledge to reverse-engineer the published table tocompromise additional private information. We shall discussthe details in Section 2.

Algorithm-based disclosure poses a significant threatto the privacy of published data, because the data publish-ing algorithm is usually considered public and may belearned by an adversary. One might argue that, given thelarge number of public algorithms that are available forPPDP, it is difficult for an adversary to precisely identifywhich algorithm has been used and thereby to launch thealgorithm-based attack. This is a typical ‘‘security throughobscurity’’ argument which counts on the secrecy of analgorithm to ensure the security of its output. However,such arguments have been repeatedly argued against andaborted in the literature of security and cryptography. AsKerckhoff’s principle [13] in cryptography states, ‘‘Thecipher method must not be required to be secret, and itmust be able to fall into the hands of the enemy withoutinconvenience.’’ Similarly, we argue that, to design aneffective algorithm for privacy-preserving data publish-ing, one must eliminate algorithm-based disclosure.

1.3. Existing work and limitations

Wong et al. [12] demonstrated the first known case ofalgorithm-based disclosure by showing that the minim-ality principle used by many existing algorithms, i.e., toperturb QI with the minimum degree possible for satisfy-ing the privacy model, may lead to the disclosure ofprivate SA information when the adversaries have theoriginal QI as external knowledge. An example of thisdisclosure will be described in Section 2. To counteractthis attack, Wong et al. proposed a new privacy modelcalled m-confidentiality [12], which guarantees that evenan adversary with knowledge of QI cannot have confi-dence of more than 1/m on the SA value of an individualtuple. This attack was also studied in [14], with a newprivacy model p-safety proposed as a countermeasure.

The new privacy models studied in the existing work,i.e., m-confidentiality [12] and p-safety [14], are bydefinition safe against (at least certain types of) algorithmdisclosure. In addition, some recently proposed privacy

models such as differential privacy [15] are also bydefinition immune from algorithm-based disclosure.While defining these new privacy models and developingtheir corresponding new algorithms provides a clean-slate solution for eliminating algorithm-based disclosure,limiting the investigation of algorithm-based disclosure tothis realm has a number of problems.

First, the state-of-the-art PPDP calls for a properunderstanding of the scope of algorithm-based disclosurefor the existing data publishing algorithms. Currently,unless a data publishing algorithm is designed for aninherently algorithm-disclosure-safe privacy model suchas differential privacy, it is unclear how to determinewhether the algorithm is vulnerable to algorithm-baseddisclosure. Meanwhile, there are considerable ongoingefforts [16,17] on developing data publishing algorithmsfor popular privacy models such as ‘-diversity which donot provide such definition-inherent guarantee againstalgorithm-based disclosure. To enable the safe deploy-ment of these algorithms in practice, it is important tounderstand whether and how algorithm-based disclosuremay occur for a given data publishing algorithm.

Furthermore, the wide prevalence of data publishingalgorithms calls for a generic method to revise the designof an existing algorithm for eliminating algorithm-baseddisclosure. In the literature, for popular privacy modelssuch as ‘-diversity, there have been not only a myriad ofalgorithms for publishing tabular data, but also numerousothers that publish application-specific data such aslocation [18], social network [19], and transaction infor-mation [20]. Instead of re-inventing algorithms for allthese applications, we argue that a more cost-effectiveway is to develop a generic method that eliminatesalgorithm-based disclosure from the existing algorithms.

1.4. Outline of technical results

In this paper, we attack the problem of algorithm-based disclosure from a novel algorithmic angle. Inparticular, we first illustrate the challenge of identifyingalgorithm-based disclosure by demonstrating that thespace of such disclosure is substantially larger thanpreviously recognized. Then, we provide a testing tool todetermine whether a given data publishing algorithm issubject to algorithm-based disclosure. Finally, we developtwo tools, global look-ahead and local look-ahead, to revisethe design of existing data publishing algorithms foreliminating algorithm-based disclosure. To recover theutility loss incurred by applying these tools, we developstratified pick-up, another tool to retain a high level ofutility for the published table.

Our detailed results can be stated as follows:First, we find that the space of algorithm-based disclosure

is much broader than previously discovered. While theprevious work identifies algorithm-based disclosure whenan adversary holds external knowledge about the QI attri-butes, we find that other forms of external knowledge, suchas the distribution of SA values and/or certain negativeassociation rules [21] can also give rise to algorithm-baseddisclosure. Our further investigation even eliminates thedependency of algorithm-based disclosure on external

Page 3: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

X. Jin et al. / Information Systems 36 (2011) 859–880 861

knowledge. That is, we find algorithm-based disclosurecan happen even when the adversary holds no externalknowledge about the published data. To this end, we findthat MASK [12], originally proposed to eliminate thepreviously discovered algorithm-based disclosure, actu-ally suffers from another type of algorithm-based disclo-sure we discover in the paper.

Second, we propose a testing tool for checking whethera given data publishing algorithm is vulnerable to algo-rithm-based disclosure. In order to do so, we first intro-duce Algorithm-SAfe data Publishing (ASAP), a model thatformally defines algorithm-based disclosure as the differ-ence between two random worlds: a naive one whereevery possible mapping between an original table and thepublished table is equally likely unless such a mappingviolates an adversary’s external knowledge, and a smart

one where the mapping must also follow the data pub-lishing algorithm. An algorithm satisfies ASAP iff it alwaysmaintains equivalence between these two worlds.

To identify the vulnerability of existing data publishingalgorithms, we derive two necessary conditions of ASAP.To induce/judge immunity against algorithm-based dis-closure, we derive two sufficient conditions for ASAP. Themain idea is to avoid any unpublished QI-SA correlation tobe used in generating the published table. The combina-tion of these necessary and sufficient conditions formsour tool for checking whether a given data publishingalgorithm is vulnerable to algorithm-based disclosure.

Third, we develop two tools, global look-ahead andlocal look-ahead, for revising the design of existing algo-rithms to follow ASAP. They are designed to amend themost common violation of ASAP found in existing algo-rithms in terms of their QI and SA perturbation strategies,respectively. To demonstrate the effectiveness of ourtools, we first apply global look-ahead to revise Mondrian[8] and Hilb [11], two well-known data publishing algo-rithms designed to achieve ‘-diversity. Then, we applylocal look-ahead to MASK [12]. We prove that all revisedalgorithms satisfy ASAP.

Fourth, we devise another tool, stratified pick-up toimprove the utility of published data without violatingASAP. The idea of stratified pick-up is to use an Anatomy[10] like technique to minimize the number of tuples ineach published QI-group (i.e., set of QI-indistinguishabletuples). To demonstrate its effectiveness, we apply strati-fied pick-up on top of the output from algorithms alteredby our first two tools, and show that they provide almostequal or even better utility than the correspondingoriginal algorithms.

Our contribution also includes a comprehensive set ofexperiments on real-world datasets. First, we measure themagnitude of algorithm-based disclosure for MASK onAdult, a popular benchmark dataset for privacy-preservingdata publishing. Also, we test the extent of algorithm-baseddisclosure for the state-of-the-art ‘-diversity algorithm Hilb.Then, we evaluate the effectiveness of our tools by compar-ing the utility of our ASAP-compliant algorithms (i.e.,Mondrianþþ, Hilbþþ, MASKþþ) against their original coun-terparts on Census, another large benchmark dataset.Experimental results show that while eliminating algo-rithm-based disclosure, our ASAP algorithms remain

efficient and achieve almost equal or even (sometimessignificantly) better utility than the existing algorithms.

The rest of the paper is organized as follows. Section 2describes two motivating examples for algorithm-baseddisclosure. Section 3 introduces preliminaries and nota-tions used in the paper. Section 4 formally defines ASAP.Section 5 derives two necessary as well as two sufficientconditions for ASAP, and verifies the vulnerability ofexisting algorithms. We develop two generic tools inSection 6 to correct the design of existing algorithms foreliminating algorithm-based disclosure, and developanother tool in Section 7 to enhance utility. We conductexperiments in Section 8, review the related works inSection 9, and conclude in Section 10.

2. Motivating examples

This section describes two motivating examples ofalgorithm-based disclosure. We consider two adversaries:‘‘n aive’’ N ash and ‘‘s mart’’ S am throughout the paper.Both of them hold the same external knowledge andobserve the same published table. The only difference isthat ‘‘naive’’ Nash does not know the data publishingalgorithm, whereas ‘‘smart’’ Sam does. Both Nash and Samwant to compromise whether their friend Tom, a 37-year-old male from Japan, has AIDS or not.

For the ease of discussion, we follow the sameSA settings as previous work [12,22]—i.e., infectious disease{AIDS} is sensitive, while non-infectious diseases {cancer,diabetes, gastritis, heart disease} are non-sensitive.

2.1. Example 1: disclosure of ‘-diversity algorithms based

on QI generalization

Consider a generalization based algorithm (e.g., [2])which achieves ‘-diversity. Table 1a depicts a microdatatable with one QI attribute Sex and one SA Disease. Table 1bis 2-diversity version of Table 1a, such that the proportion ofany sensitive SA value in one QI-group is at most 1=‘¼ 1

2.First, let us review the case of algorithm-based disclo-

sure discussed in [12], where both ‘‘naive’’ Nash and‘‘smart’’ Sam know the original QI (Table 1c) throughexternal knowledge. What Nash can do is to join Table 1cwith the published Table 1b to infer that Tom belongs tothe ‘‘n’’-group. Thus, from Nash’s view, the probability ofTom having AIDS is 1

2, which does not violate 2-diversity.We now consider ‘‘smart’’ Sam who knows that thegeneralization algorithm will not generalize any groupunless it violates 2-diversity. Based on this, Sam can inferthat no generalization would have been conducted if thetwo males had 0 or 1 AIDS. Therefore, both males, includ-ing Tom, must have AIDS. Hence, by leveraging the algo-rithm-based knowledge, ‘‘smart’’ Sam acquires a differentview from ‘‘naive’’ Nash, and Sam’s view violates 2-diversity. This is an example of algorithm-based disclosure.

Now, we show the limitation of [12] by demonstratingthat algorithm-based disclosure may occur without invol-ving any external knowledge. Note that when ‘‘naive’’Nash holds no external knowledge, his view of Tom’s SA isthe same as what the published table discloses, which bydefinition satisfies 2-diversity.

Page 4: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

Table 1

An example of algorithm-based disclosure in ‘-diversity algorithm.

(a) Microdata

ROW# SEX DISEASE

1 F Gastritis

2 F Heart disease

3 F Cancer

4 F Diabetes

5 M AIDS6 M AIDS

(b) 2-Diversity table

SEX DISEASE

F Gastritis

F Heart disease

� Cancer

� Diabetes

� AIDS� AIDS

(c) External knowledge

NAME SEX

Amy F

Eva F

Grace F

Helen F

Jack M

Tom M

(d) (1M,1AIDS)

SEX DISEASE

F Gastritis

F Heart disease

F Cancer

F AIDS� Diabetes

� AIDS

(e) (3M,2AIDS)

SEX DISEASE

F Gastritis

F Heart disease

� Cancer

� AIDSM Diabetes

M AIDS

(f) (2M,0AIDS)

SEX DISEASE

F Gastritis

F Heart disease

F AIDSF AIDSM Cancer

M Diabetes

(g) (2M,1AIDS)

SEX DISEASE

F Gastritis

F Heart disease

F Cancer

F AIDSM Diabetes

M AIDS

3 For the ease of illustration, we assume here that ‘‘smart’’ Sam has a

uniform prior. Note, however, that such an assumption by no means

restricts the generality of our discussion, as other distributions would

work as well.

X. Jin et al. / Information Systems 36 (2011) 859–880862

Consider the view of ‘‘smart’’ Sam. He can reason asfollows: (1) the number of males in the table should beless than 4 but greater than 0, because otherwise no

generalization would be needed; (2) if there were onlyone male, Table 1b would not be published because thealgorithm would prefer an alternative FFFF** (i.e.,Table 1d) to attain better data utility; (3) if there werethree males, Table 1b would again not be publishedbecause of another alternative FF**MM (i.e., Table 1e)with better utility. Apparently, there is only one optionleft, that is, two males in the table. If none or only one ofthem had AIDS, no generalization would be needed (i.e.,Tables 1f and g). Thus, both males, including Tom, musthave AIDS. One can see that the above deduction is solelyenabled by Sam’s knowledge of the algorithm and violatesthe requirement of 2-diversity. Thus, algorithm-baseddisclosure may occur without any external knowledgebeyond the anonymization algorithm.

2.2. Example 2: disclosure of MASK algorithm based on SA

perturbation

MASK [12] was the first attempt to eliminate algo-rithm-based disclosure. It aims to achieve m-confidenti-ality, which (when ‘¼m) maintains ‘-diversity even if anadversary has the original QI as external knowledge.

Consider a microdata table in Table 2a. Tables 2b and cdepict an example of using MASK to achieve 2-confidenti-ality. MASK first applies k-anonymization (kZm) to themicrodata table (e.g., 4-anonymity in Table 2b). Then, foreach group violating ‘-diversity (e.g., the ‘‘Japan’’ group),MASK randomly perturbs the sensitive SA values (e.g., AIDS)to non-sensitive values (e.g., cancer, heart disease), until theproportion of sensitive SA values is decreased to p, where p isthe proportion of sensitive SA values from a randomlyselected ‘-diversity group (e.g., p¼ 1

4 in the ‘‘Mexico’’ group).We now show the existence of algorithm-based dis-

closure in Table 2c when an adversary knows a negativeassociation rule from common-sense, say, ‘‘Japanese havean extremely low incidence of heart disease [2,12]’’.Consider the view of ‘‘naive’’ Nash. He can conclude fromTable 2c that Tom is in the ‘‘Japan’’ group, and heartdisease must be a perturbed value because the heartdisease rate in that group (i.e., 25%) conflicts with thenegative association rule. But without knowing the MASKalgorithm, ‘‘naive’’ Nash can only randomly guessthe original value of heart disease to be AIDS or cancer.3

Thus, the probability of Tom having AIDS in hisview is: 50%� 1

2 þ50%� 14 ¼

38. This does not violate

2-confidentiality.Now consider the view of ‘‘smart’’ Sam, who knows

that MASK would not perturb any SA values in the‘‘Japan’’ group unless the group violates 2-confidenti-ality after k-anonymization (i.e., Table 2b). Thus, Samconcludes that the ‘‘Japan’’ group should have at leastthree AIDS (out of four tuples). As such, in ‘‘smart’’Sam’s view, the probability of Tom having AIDS is atleast 3

4, which violates 2-confidentiality. Again, knowingthe algorithm empowers ‘‘smart’’ Sam to gain a different

Page 5: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

Table 2An example of algorithm-based disclosure in MASK algorithm.

(a) Microdata

ROW # AGE SEX COUNTRY DISEASE

1 46 F Mexico Cancer

2 49 F Mexico Heart disease

3 32 F Mexico Heart disease

4 35 F Mexico AIDS5 24 F Japan AIDS6 38 F Japan AIDS7 25 M Japan AIDS8 (Tom) 37 M Japan AIDS

(b) k-Anonymity table (k¼4)

AGE SEX COUNTRY DISEASE

[32–49] F Mexico Cancer

[32–49] F Mexico Heart disease

[32–49] F Mexico Heart disease

[32–49] F Mexico AIDS[24–38] n Japan AIDS[24–38] n Japan AIDS[24–38] n Japan AIDS[24–38] n Japan AIDS

(c) m-Confidentiality (m¼2)

AGE SEX COUNTRY DISEASE

[32–49] F Mexico Cancer

[32–49] F Mexico Heart disease

[32–49] F Mexico Heart disease

[32–49] F Mexico AIDS[24–38] n Japan Cancer

[24–38] n Japan Cancer

[24–38] n Japan Heart disease

[24–38] n Japan AIDS

X. Jin et al. / Information Systems 36 (2011) 859–880 863

view from ‘‘naive’’ Nash, where Sam’s view violatesm-confidentiality.

Consider another algorithm-based disclosure situationin MASK if an adversary has access to some original SAdistribution. This is common in reality because datapublishers may report statistics for public use. For exam-ple, in order to ease the fear of increasing cancer incidencein the community, a local hospital may announce that‘‘only one out of eight hospitalized patients (in thepublished table) has cancer’’.

Now consider what ‘‘naive’’ Nash can compromisefrom the published Table 2c. He can confirm that MASKshould have perturbed two SA values to cancer. However,he cannot further tell which two out of the three cancerare the perturbed values, and whether AIDS or heartdisease is the original value. As such, the probability ofTom having AIDS in the view of ‘‘naive’’ Nash is33:3%� 3

8 þ33:3%� 38 þ33:3%� 1

2 ¼5

12. Likewise, thisdoes not violate 2-confidentiality.

Whereas, ‘‘smart’’ Sam, who knows the MASK algo-rithm, can infer that the extra 2 cancer must be from the‘‘Japan’’ group, and AIDS is their original value. The reasonis: otherwise, MASK would not conduct any perturbationbecause both groups in the table after k-anonymizationare already 2-confidentiality. Thereby, there exists algo-rithm-based disclosure because the probability of Tomhaving AIDS in Sam’s view is at least 3

4, which violates2-confidentiality.

As we can see, MASK is still subject to algorithm-baseddisclosure. And algorithm-based disclosure can exist alongwith various types of external knowledge, or even withoutexternal knowledge. We re-emphasize that this paper isaiming to limit the algorithm-based disclosure (i.e., theview of ‘‘smart’’ Sam), that is, the private informationbeyond what can be gained by external knowledge.

3. Preliminaries

3.1. Privacy-preserving data publishing

Consider T¼{t1,y,tn}, a microdata table of n tuples.Each ti consists of d QI attributes /Q1,Q2, . . . ,QdS, denotedby Q and one SA attribute, denoted by S. For example,Table 2a is a table of n¼8 tuples. Each tuple has d¼3 QIattributes, i.e., Q ¼/Age, Sex, CountryS and 1 SA, i.e., S¼

Disease. For any tuple t 2 T , let t½Q � ¼/t½Q1�, t½Q2�, . . . ,t½Qd�S be a vector of QI values in tuple t; let t[S] be the SAvalue of t. Let DQ ¼DQ1

�DQ2� � � � �DQd

and DS be thefinite domain of Q and S, respectively. We say a tuplet¼/q,sS is in the table T (denoted by t 2 T) where q 2

DQ ,s 2 DS iff (i 2 ½1,n� such that t[Q]¼q and t[S]¼s.Before releasing the data, a data publisher takes a data

publishing algorithm A to perturb the microdata T. Let T*be the published table of T; let Q* be the perturbed QIattributes in T*. The published table T* consists of severalQI-groups—i.e., partitions of tuples such that each indivi-dual tuple is indistinguishable from any others in thesame QI-group. Thus, the correlation between QI and SAattributes in T* is regarded as the private information tobe protected. We represent such QI–SA correlation by S�ð�Þ,which maps q 2 DQ , the QI attributes of an individualtuple, to the posterior distribution of SA for that tuple.Formally, we have the following definition:

Definition 1 (QI–SA correlation). Let T and T* be theoriginal and published microdata tables, respectively.Given any tuple t¼/q,sS 2 T , the QI–SA correlationS�ðqÞ in T* with respect to t is a jDSj-component vector/S�ðqÞ½s1�,S�ðqÞ½s2�, . . ., S�ðqÞ½sjDS j�ÞS, where jDSj equals theSA domain size and S�ðqÞ½si� ¼ Prft½S� ¼ sijt½Q � ¼ q,T�g.

An example of Q* and S�ð�Þ in Table 1b (and Table 2c) isshown in Table 3 (and Table 4, respectively).

We are now ready to state the ‘-diversity privacymodel [2] in terms of S�ð�Þ. In particular, we adopt asimple variation of ‘-diversity [10,12,11] which requiresthat no individual SA value can be compromised withprobability over 1

‘:

Definition 2 (‘-diversity, Machanavajjhala [2]). A pub-lished table T* fulfills ‘-diversity iff 8t¼/q,sS 2 T and8si 2 DS,

maxj2½1,jDSj�

S�ðqÞ½si�r1

‘:

3.2. Expression of external knowledge

As discussed in Section 2, certain type of externalknowledge may facilitate algorithm-based disclosure,

Page 6: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

Table 3Qn and S� of Table 1b.

(a) Q �

SEX

F

(b) S�AIDS Cancer Diabetes Gastritis Heart disease

0 0 0 1/2 1/2

1/2 1/4 1/4 0 0

Table 4Qn and S� of Table 2c.

(a) Qn

AGE SEX COUNTRY

[32–49] F Mexico

[24–38] � Japan

(b) S�AIDS Cancer Heart disease

1/4 1/4 1/2

1/4 1/2 1/4

X. Jin et al. / Information Systems 36 (2011) 859–880864

though algorithm-based disclosure does not mandate theadversary’s possession of external knowledge. For theease of understanding of our ASAP model in the nextsection, we formalize a simple expression of externalknowledge by conjunctive COUNT query. Given themicrodata T, consider a COUNT query CQ(T) in the form:

refl

SELECT COUNT(*) FROM T

WHERE ðQ1 ¼ q1Þ4 � � �4ðQd ¼ qdÞ4ðS¼ sÞ

Note that the selection condition (i.e., WHERE clause)does not require to include every Qjðj 2 ½1,d�Þ or S in T. Wedescribe external knowledge Ke as arithmetic equations(or inequalities) between a pair of COUNT query answers,or between one COUNT query answer and one constant.

Consider Table 2a as the microdata. An example ofexternal knowledge about Tom, who is a 37-year-old malefrom Japan, is ‘‘Tom does not have cancer’’. We can ex-press such Ke as CQ(T)¼0 where CQ(T)¼SELECT COUNT(*)FROM T WHERE Age¼ 374Sex¼M4Country¼ Japan4Disease¼ cancer.

Another example of Ke is ‘‘Japanese have an extremelylow incidence of heart disease’’, which can be descri-bed by CQ1ðTÞ=CQ 2ðTÞo0:054 where CQ1ðTÞ¼SELECTCOUNT(*) FROM T WHERE Country¼ Japan4Disease¼

heart disease and CQ2ðTÞ¼SELECT COUNT(*) FROM T.

4. Algorithm-SAfe Publishing

This section formalizes algorithm-based disclosure byintroducing a new model called Algorithm-SAfe Publishing

4 The value of 0.05 can be adjusted according to actual needs for

ecting the effect of ‘‘extremely low incidence’’.

(ASAP). A data publishing algorithm is vulnerable toalgorithm-based disclosure when it violates ASAP. Wewill first define two key concepts relating to ASAP: a naive

random world and a smart random world, which modelsthe view without and with knowledge of the algorithm,respectively. Then, we will define ASAP based on theequivalence between these two worlds.

4.1. Naive vs. Smart random world

As in Section 3.1, let DQ and DS be the domains of QIand SA, respectively. Let O be a finite set of all possiblevalues in the microdata that can be calculated fromDQ �DS. When an adversary with external knowledgeKe observes a published table T* his/her view on themicrodata table T can be modeled as a (posterior) prob-ability distribution over O, that is, a mapping from anyT 0DO to a real value PrðT ¼ T 0jT�,KeÞ 2 ½0,1�, such thatP

T 0DOPrðT ¼ T 0jT�,KeÞ ¼ 1.First, let us consider ‘‘naive’’ Nash as illustrated in

Section 2. In his view of any T 0DO, T 0 is likely to be themicrodata (i.e., PrðT ¼ T 0jT�,KeÞ40) iff: (1) any tuple t 2 T 0

is bijectively mapped to a tuple t� 2 T� such that the valueof t*[Q*] is no less specific than that of t[Q]; and (2) T 0

satisfies the integrity conditions imposed by Ke. To denotesuch a ‘‘non-zero likelihood’’ relationship, we use T 0 ) T�,indicating that T 0 could possibly be published as T*.Nonetheless, without learning the data publishing algo-rithm A, ‘‘naive’’ Nash cannot distinguish between any T 0

in set fT 0jT 0DO,T 0 ) T�g. According to the standard ran-dom world assumption [2,23], Nash has to assign an equalprobability to each of them. Thus, we define the view of

Nash as a naive random world NWð�Þ:

Definition 3 (Naive random world). A naive randomworld NWð�Þ is a probability distribution such that8T 0DO,

NWðT 0Þ ¼1=c if T 0 ) T�,

0 otherwise,

(

where c¼ jfT 0jT 0DO,T 0 ) T�gj.

To illustrate the definition, consider the previousexample of Table 1 in a simple way: AIDS is the sensitiveSA value (shadow background) while other SA values (nobackground) are indistinguishable. Suppose Nash hasexternal knowledge Ke in the form of two rules: ‘‘Amyand Grace are unlikely to have AIDS’’ and ‘‘at least onemale has AIDS’’. Table 5a shows an example of this naiverandom world NWð�Þ. In the view of ‘‘naive’’ Nash, he canfind a total of six values of T 0 after linking AIDS to theoriginal QI attributes without violating Ke. Since Nashcannot distinguish any of these six values (i.e., tables)from another, NWðT 0Þ ¼ 1

6 has to be assigned equally toeach T 0. Thereby, the probability distribution on these sixT 0 constitute the naive random world NW.

Second, consider the view of ‘‘smart’’ Sam who learnsthe mechanism of the data publishing algorithm A. Samis able to further distinguish each T 0 in the setfT 0jT 0DO,T 0 ) T�g by taking each T 0 as the input to A

and then checking whether A truly outputs T* as the

Page 7: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

Table 5An example of naive and smart random world.

(a) Naive random world

NAME QI SA SA SA SA SA

Amy F

Eva F A A

Grace F

Helen F A A

Jack M A A A

Tom M A A A

(b) Smart random world

NAME QI SA

Amy F

Eva F

Grace F

Helen F

Jack M A

Tom M A

X. Jin et al. / Information Systems 36 (2011) 859–880 865

published table (if A is a deterministic algorithm), orwhether T* could be a possible output (if A is a rando-mized algorithm). We define the view of Sam as a smartrandom world SWð�Þ:

Definition 4 (Smart random world). A smart randomworld SWð�Þ is a probability distribution, 8T 0DO,

SWðT 0Þ ¼PrfT ¼ T 0jAg if T 0 ) T�,

0 otherwise:

(

Return to the example in Table 6a. ‘‘Smart’’ Samiteratively performs the 2-diversity algorithm, by accept-ing each T 0 in Table 6a as input. Table 6b shows the only T 0

that may produce T* according to the algorithm—all theother five T 0 already satisfy 2-diversity. Thus, any furthergeneralization of them is unnecessary. One can see that‘‘smart’’ Sam can then construct the smart random worldSW by assigning SWðT 0Þ ¼ 1 to the T 0 in Table 6b andSWðT 0Þ ¼ 0 to all others.

4.2. Definition of ASAP

One can see from the above discussion that whether ornot an algorithm is vulnerable to algorithm-based dis-closure is determined by whether the naive random worldNW equals the smart random world SW given the sameexternal knowledge Ke and published table T*. There is no

algorithm-based disclosure iff the two worlds are alwaysequivalent conditioning on the same external knowledgeand published table. Formally, we define Algorithm-SAfePublishing (ASAP) as follows.

Definition 5 (Algorithm-SAfe publishing). A publishedtable T* fulfills Algorithm-SAfe Publishing (ASAP) iff8t¼/q,sS 2 T , 8si 2 DS, there is

Prft½S� ¼ sijt½Q � ¼ q,NWg ¼ Prft½S� ¼ sijt½Q � ¼ q,SWg:

5. Checking algorithm-based disclosure

This section presents two necessary and two sufficientconditions for ASAP, respectively. These conditions jointly

serve as an exploratory tool to screen a data publishingalgorithm for algorithm-based disclosure.

5.1. Necessary condition 1: Q*-independence

Q*-independence, our first necessary condition, ismotivated by cases where an adversary may learn theoriginal QI through external knowledge. In particular,Q*-independence requires that during QI perturbation(e.g., generalization), the data publishing algorithm mustnot use any QI–SA correlation that cannot be inferredfrom the published S�. Otherwise, algorithm-based dis-closure may occur. This necessary condition explains whyExample 1 in Section 2 is subject to algorithm-baseddisclosure.

Theorem 5.1 (Q*-independence). Let T be a microdata

table and T* be its ASAP published table. The published QI

attributes Q* in T* must be conditionally independent of the

original SA attribute S in T, given a combination of the

original QI attributes Q and the published QI–SA correlation

S�, denoted by Q�? SjðQ ,S�Þ.

Proof. Let DQ and DS be domains of the original QI andSA, respectively. Let Ke be the external knowledge. In ourcase, Ke is specialized to be the original QI.

First, consider in a view of naive random world(Definition 3). Given a published table T* and the originalQI attribute t[Q]¼q such that t¼/q,sS 2 T , a naiveadversary is unable to distinguish any si 2 DS such thatS�ðqÞ½si�40. t½S� ¼ si and S�ðq0Þ, 8q0 2 Q \q (i.e., the pub-lished SA of tuples other than q) in the naive randomworld are conditionally independent, given S�ðqÞ. Thus,we have

Prft½S� ¼ sijt½Q � ¼ q,NWg ¼ Prft½S� ¼ sijt½Q � ¼ q,S�ðqÞg: ð1Þ

Second, consider in a view of the smart random world(Definition 4). A smart adversary can further distinguishsi by checking whether or not the algorithm wouldpublish a table with Q* (perturbed from Q) and with theQI–SA correlation S� as is. Hence, we have

Prft½S� ¼ sijt½Q � ¼ q,SWg ¼ Prft½S� ¼ sijt½Q � ¼ q,S�,Q�,Qg:ð2Þ

From the definition of ASAP (i.e., Definition 5), we have

Prft½S� ¼ sijt½Q � ¼ q,S�ðqÞg ¼ Prft½S� ¼ sijt½Q � ¼ q,S�,Q�,Qg:ð3Þ

Consider t[S] as a random variable. To measure theuncertainty of t[S], Eq. (3) can be transformed from theperspective of information theory [24] as

Hðt½S�jt½Q � ¼ q,S�ðqÞ,Q Þ ¼Hðt½S�jt½Q � ¼ q,S�,Q�,Q Þ, ð4Þ

where HðxjyÞ is the conditional entropy [25] which mea-sures the uncertainty of a random variable x given y

is known.

Page 8: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

X. Jin et al. / Information Systems 36 (2011) 859–880866

For a microdata T with n tuples t1, t2,y, tn whereti[Q]¼qi, i 2 ½1,n�, we haveX1r irn

Hðti½S�jti½Q � ¼ qi,S�ðqiÞ,Q Þ

¼X

1r irn

Hðti½S�jti½Q � ¼ qi,S�,Q�,Q Þ, ð5Þ

) HðSjS�,Q Þ ¼HðSjS�,Q�,Q Þ, ð6Þ

) IðS;Q�jS�,Q Þ ¼ 0, ð7Þ

where Iðx; yjzÞ is conditional mutual information [25],indicating the amount of information about either x or y

provided by knowing the other, given z is known. Theequivalence between Eqs. (5) and (6) holds due to thefollowing two reasons: (1) ti[S]¼si and S�ðq0Þ, 8q0 2 Q \qi inthe naive random world are conditionally independent,given S�ðqiÞ and (2) each tuple t 2 T can be considered tobe independently generated with certain distribution onDQ �DS.

From Eq. (7), we know that Q* and S should be con-ditionally independent given S� and Q. Otherwise, theirconditional mutual information would not be zero. Thereby,our necessary condition Theorem 5.1 is proved. &

The basic idea for the theorem proof can be stated asfollows: In the perturbation of Q to Q*, suppose a datapublishing algorithm consults certain QI–SA correlationinformation, but does not ultimately publish it in T*.According to the basics of mutual information [25], anadversary can recover the consulted QI–SA correlation(even though unpublished) according to the perturbationof Q to Q*, which can be readily observed by an adversarywith knowledge of QI.

Examples of existing algorithms which violate Q*-inde-pendence include algorithms designed for ‘-diversity[2,3,11,8,9], t-closeness [4], (k,e)-anonymity [5], (c,k)-safety [23], etc. The reason can be intuitively stated asfollows. All these data publishing algorithms follow asimilar pattern that they gradually optimize the publishedtable according to a utility metric (e.g., discernibility [26],classification metric [27], KL-divergence [28]), untilreaching a table which violates the privacy guarantee tobe achieved. At this time, they fall back to the previoustable which provides the best utility without violating theprivacy guarantee. Unfortunately, Q*-independence isviolated at the moment when the algorithm determinesthat a table violates the privacy guarantee because, tomake such a decision, the algorithm must have observedcertain QI–SA correlation which can never appear in thepublished table (as such correlation violates the privacyguarantee). As a result, according to Theorem 5.1, theunpublished, guarantee-violating, QI–SA correlation maybe derived by smart Sam from observing how QI has beenperturbed in the published table. Thus, the algorithms arevulnerable to algorithm-based disclosure.

5.2. Necessary condition 2: S�-independence

In analogy to Q*-independence, our second necessarycondition S�-independence aims to check algorithm-based disclosure by analyzing the QI–SA correlation used

for generating S� in the published table. In addition to theexternal knowledge of QI as considered in Q�-indepen-dence, S�-independence also takes into account the pos-sible external knowledge in the form of negativeassociation rules, e.g., ‘‘Tom is unlikely to have cancer’’.Essentially, S�-independence states that no QI–SA corre-lation beyond what is ultimately published should beused in the perturbation of SA.

For the ease of understanding, let us first introduce afew notations. Given an individual tuple t¼/q,sS 2 T, theexternal knowledge Ke may include a set of ‘‘unlikely’’ SAS0DDS such that for each value in S0, it cannot be linkedto that individual tuple t, i.e., 8s 2 S0, S�ðqÞ½s� ¼ 0. Other-wise, we say the published table T* contradicts with Ke.S�ðqÞ½DS\S0� represents the distribution of all ‘‘likely’’ SA inT* that can be linked to q in T. Return to Example 2 inSection 2. External knowledge of ‘‘Tom is unlikely to havecancer’’ indicates that S0 ¼ fcancerg for Tom. Referringto Table 4, we have S�ðqÞ½DS\S0� ¼/S�ðqÞ½AIDS�,S�ðqÞ½heart disease�S¼/ 1

4 , 14S.

Theorem 5.2 (S�-independence). Let T be a microdata

table and T* be its ASAP published table. For any individual

tuple t¼/q,sS 2 T , given a set of ‘‘unlikely’’ SA S0 and the

published S�ðqÞ½DS\S0� in T*, the published QI–SA correlation

S�ðqÞ in T* must be conditionally independent of its original

SA attribute s in T, denoted by S�ðqÞ ? sjðS0,S�ðqÞ½DS\S0�Þ.

Proof. Let Ke be the external knowledge which is essen-tially S0 regarding an individual tuple t¼/q,sS 2 T . Wefirst consider in a view of naive random world. A naiveadversary has no ability to distinguish any si 2 DS\S0,given T*. Hence, we have

Prft½S� ¼ sijt½Q � ¼ q,NWg ¼ Prft½S� ¼ sijt½Q � ¼ q,S�ðqÞ½DS\S0�,S0g:

ð8Þ

Next, in a view of smart random world, a smartadversary would further verify whether the data publish-ing algorithm would publish Q* and S� as is when s/helearns S0 regarding the individual t. Hence, we have thefollowing:

Prft½S� ¼ sijt½Q � ¼ q,SWg ¼ Prft½S� ¼ sijt½Q � ¼ q,S�,Q�,S0g:

ð9Þ

Similar to the previous proof of Theorem 5.1, we followDefinition 5 and account for the equivalence between Eqs.(8) and (9) in an information theoretic way:

Hðt½S�jt½Q � ¼ q,S�ðqÞ½DS\S0�,S0Þ ¼Hðt½S�jt½Q � ¼ q,S�,Q�,S0Þ:

ð10Þ

Since the uncertainty measured by the conditionalentropy HðxjyÞ can be reduced to HðxjzÞ when y can bederived from z [25], we have the following inequality:

Hðt½S�jt½Q � ¼ q,S�ðqÞ,S0ÞZHðt½S�jt½Q � ¼ q,S�,Q�,S0Þ: ð11Þ

Inequality (11) holds because S�ðqÞ can be derived from{t½Q � ¼ q,S�}. Therefore, combining Eq. (10) and Inequality

Page 9: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

5 Note that such a simulation is not a duplication of the data

publishing process. When the data publishing algorithm is non-deter-

ministic, one can simulate the randomized part by using the same

random number generator, but does not have to generate the same

random number.

X. Jin et al. / Information Systems 36 (2011) 859–880 867

(11), we have

Hðt½S�jt½Q � ¼ q,S�ðqÞ,S0ÞZHðt½S�jt½Q � ¼ q,S�ðqÞ½DS\S0�,S0Þ:

ð12Þ

Consider another Inequality (13) as follows:

Hðt½S�jt½Q � ¼ q,S�ðqÞ½DS\S0�,S0ÞZHðt½S�jt½Q � ¼ q,S�ðqÞ,S0Þ:

ð13Þ

Inequality (13) holds because S�ðqÞ½DS\S0� can be derivedfrom {S�ðqÞ,S0}.

By Inequalities (12) and (13), we then derive

Hðt½S�jt½Q � ¼ q,S�ðqÞ½DS\S0�,S0Þ ¼Hðt½S�jt½Q � ¼ q,S�ðqÞ,S0Þ

ð14Þ

) Iðt½S�;S�ðqÞ½S0�jt½Q � ¼ q,S�ðqÞ½DS\S0�,S0Þ ¼ 0: ð15Þ

Recall that S�ðqÞ½S0� represents the ‘‘unlikely’’ (specifiedby Ke) SA in T* to be linked to tuple t. By the definition ofmutual information [25], Eq. (15) implies at least one ofthe following two implications is true. First, it implicatesthat S�ðqÞ½S0� is determined by S0. In other words, T* never

contradicts with Ke. However, it is impossible in realitybecause no one can foresee all kinds of S0 controlled by Ke.Thus, we focus on the second implication which requiresthat S�ðqÞ½S0� should be conditionally independent of theoriginal t[S]. Note that

Ps2S0S�ðqÞ½s� ¼ 1�

Ps2DS\S0

S�ðqÞ½s�.Thus, the entire S�ðqÞ½�� should be conditionally indepen-dent of the original t[S], given ft½Q � ¼ q,S�ðqÞ½DS\S0�,S0g. Assuch, Theorem 5.2 is proved. &

The basic idea of the proof is similar to that of Theorem5.1—according to the basics of mutual information [25], ifthe perturbation of SA relies on certain QI–SA correlationinformation that is not eventually published, then suchunpublished information may be inferred be an adversaryby observing whether the published SA contradicts itsexternal knowledge Ke, i.e., whether 8s 2 S0,S�ðqÞ½s� ¼ 0.

For example, MASK [12] is vulnerable to algorithm-based disclosure due to violation of Theorem 5.2. Recallthat MASK first checks whether a group violates ‘-diversity,and will only perturb SA in those offending groups. Again,such SA perturbation demands the usage of unpublishedQI–SA correlation because it is not (always) possible todetermine whether a group violates ‘-diversity based solelyupon the table published by MASK. According to Theorem5.2, S�-independence is violated.

5.3. ASAP sufficient conditions

This subsection provides two sufficient conditions topublish an ASAP table. Recall from the two necessaryconditions that a fundamental cause for the vulnerabilityof many existing data publishing algorithms is the usageof unpublished QI–SA correlation in the perturbation of QIand/or SA. Our first sufficient condition focuses on cor-recting the problem of QI perturbation while prohibitingSA perturbation.

Theorem 5.3 (ASAP sufficient condition 1). The published

table from a data publishing algorithm fulfills ASAP if the

algorithm satisfies both of the following two conditions: (1)

QI is perturbed whereas SA is not and (2) the QI perturbation

depends on no information beyond the original QI and the

published QI–SA correlation.

Proof. We prove it by contradiction. Assume that apublished table T* generated from an algorithm satisfyingboth conditions in Theorem 5.3 is not ASAP. In theperspective of information theory, there must exist atuple t¼/q,sS 2 T and external knowledge Ke such that

Hðt½S�jt½Q � ¼ q,Q�,S�,KeÞoHðt½S�jt½Q � ¼ q,S�ðqÞ,KeÞ: ð16Þ

Note that Hðt½S�jt½Q � ¼ q0,Q�,S�,KeÞrHðt½S�jt½Q � ¼ q0,S�ðq0Þ,KeÞ holds for all other tuples /q0,sS 2 T . Since SA is notperturbed, we know that t½S� and S�ðq0Þ, 8q0 2 Q \q (i.e., thepublished SA of tuples other than q) are conditionallyindependent, given S�ðqÞ. Moreover, since each tuple t 2 T

can be considered to be independently generated withcertain distribution on DQ �DS, we have

HðSjQ ,Q�,S�,KeÞoHðSjQ ,S�,KeÞ ð17Þ

) IðS;Q�jQ ,S�,KeÞ40: ð18Þ

From the second condition in Theorem 5.3, we know thatHðQ�jQ ,S�,KeÞ ¼ 0 for all possible Ke. As such, given Q

and S�, S can provide no additional information about Q*.That is,

IðS;Q�jQ ,S�,KeÞrHðQ�jQ ,S�,KeÞ ¼ 0: ð19Þ

This is contradictory to (18). Thus, algorithms satisfyingboth conditions in Theorem 5.3 always publish ASAPtables. &

Intuitively, this sufficient condition states that if any-one who has access to the data publishing algorithm, thepublished table and the original Q can simulate5 theperturbation of QI without consulting any additional(unpublished) QI–SA correlation, then the QI-perturbationprocess is immune from algorithm-based disclosure. Forexample, Anatomy [10] satisfies Theorem 5.3 because (1)Anatomy uses only SA values in the partitioning (i.e., QIperturbation) and (2) it does not perturb SA at all. Thus,Anatomy is immune from algorithm-based disclosure.

In analogy, we propose the second sufficient conditionby also addressing SA perturbation.

Theorem 5.4 (ASAP sufficient condition 2). The published

table of a data publishing algorithm fulfills ASAP if the

algorithm satisfies both of the following two conditions: (1)the QI perturbation depends on no information beyond the

original QI and (2) the SA perturbation depends on no

information beyond the published QI–SA correlation.

Proof. The proof is in analogy to that of Theorem 5.3. &

One can see that the intuitive explanation of thissufficient condition is also similar to that of sufficientcondition 1 – it again assures ASAP if anyone who hasaccess to the data publishing algorithm, the published table

Page 10: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

X. Jin et al. / Information Systems 36 (2011) 859–880868

and the original Q can simulate the perturbation process –this time on both QI and SA. Therefore, in the rest of thispaper, we say a data publishing algorithm is simulatable iffthe algorithm satisfies either Theorem 5.3 or 5.4.

Nevertheless, it is important to point out that neitherTheorem 5.3 nor 5.4 is necessary (although sufficient) forpublishing ASAP tables. To explain it, consider a simplealgorithm to guarantee ‘-diversity—by generalizing QIusing existing algorithms such as Mondrian [8] andsuppressing all SA from the published table. This algo-rithm certainly satisfies ASAP because no QI–SA correla-tion is ever disclosed. Nonetheless, it violates the firstcondition of Theorem 5.3 because SA is suppressed (i.e.,perturbed), and violates the first condition of Theorem 5.4because the perturbation of QI consults unpublishedQI–SA correlation (see Section 5.1).

6. Eliminating algorithm-based disclosure

This section introduces two amendment tools: globallook-ahead and local look-ahead, each of which on itsown suffices to eliminate algorithm-based disclosurefrom vulnerable algorithms. To demonstrate the powerof global look-ahead, we apply it to alter the design of twopopular ‘-diversity algorithms: Mondrian [8] and Hilb[11], and prove the revised algorithm to satisfy Theorem5.3. As to demonstrate the power of local look-ahead, wetransform the m-confidentiality algorithm MASK [12] to asimulatable algorithm by proving that the revised algo-rithm satisfies Theorem 5.4.

6.1. A running example

For the ease of understanding, we use the followingsimple microdata table as a running example throughoutthis section: there are eight tuples with two QI attributesx and y and one SA. Fig. 1 depicts a two-dimensionalvisualization of the tuples on an x–y plane. In the figure,each tuple is represented by a circle, with its x and y

coordinates indicating its values of QI attributes x and y,respectively, and its painted pattern indicating the SAvalue. Each pattern is corresponding to a distinct SA value.Thus, there are six different SA values in our example.

Fig. 1. x–y plane of eight tuples.

6.2. Global look-ahead

The global look-ahead is mainly to enable a data publish-ing algorithm to generate a QI-perturbation (e.g., partitionof tuples into QI-indistinguishable groups) only if a worst-case (i.e., most skewed) scenario of QI–SA correlation is ableto achieve the predefined privacy model such as ‘-diversity.The global look-ahead guarantees that no informationbeyond unpublished QI–SA correlation information will beused. To illustrate it, we take two existing ‘-diversityalgorithms, i.e., Mondrian and Hilb, for example.

6.2.1. From Mondrian to simulatable Mondrianþ

We first review the original Mondrian [8], and thendiscuss how to transform it via global look-ahead toMondrianþ , a simulatable data publishing algorithm.

The original Mondrian works in a recursive fashion. Asimple implementation of it begins with selecting a splitattribute that has the largest range of value. Alternatively,strategies discussed in [9] can be used in the selection aswell. After that, Mondrian repetitively partitions G (initi-ally T) into two groups G1 and G2, where G1 and G2 includethe tuples of G divided by the median coordinate on thesplit attribute. Let j � j be the number of tuples in the set,and Smaxð�Þ be the number of tuples with the mostfrequent SA value in the set. If either jG1jo‘ � SmaxðG1Þ

or jG2jo‘ � SmaxðG2Þ holds, such partitioning trial has tobe revoked.

Fig. 2a is an example of the original Mondrian algorithm.Suppose the algorithm first chooses x as the split attributewith the largest range, and partitions the eight tuples by themedian coordinate (x¼5) into two QI-groups: G1¼{t4, t5, t7,t8} and G2¼{t1, t2, t3, t6}. Both G1 and G2 satisfy 2-diversitybecause jG1j ¼ 4Z‘ � Smax ðG1Þ ¼ 2� 2¼ 4 and jG2j ¼ 4Z‘ � SmaxðG2Þ ¼ 2� 1¼ 2. For the same reason, G2 is furtherpartitioned and published as two QI-groups Group2 andGroup3 (see Fig. 2a).

Unlike G2, G1 has to be published as is. The reason isthat regardless of which median coordinate to be chosen(i.e., y¼3 or x¼2), any further trial on partitioning has tobe revoked due to violating 2-diversity. Take the case ofy¼3 as the median coordinate in G1 for example. BothQI-groups g1¼{t4, t5} and g2¼{t7, t8} violate 2-diversitybecause jg1j ¼ 2o‘ � Smaxðg1Þ ¼ 4 and jg2j ¼ 2o‘ � Smax

ðg2Þ ¼ 4. Note that the information of Smaxðg1Þ ¼ 2 andSmaxðg2Þ ¼ 2 consulted by Mondrian at this point, cannotbe recovered in the ultimately published but un-parti-tioned G1 ¼ {t4, t5, t7, t8} (i.e., Group1 in Fig. 2a). Hence, theproblem of the original Mondrian is to use unpublishedQI–SA correlation information in the QI perturbation,more specifically when the partitioning has to be revoked.

To fix the problem, we follow the idea of global look-ahead to alter Mondrian to Mondrianþ with minorchange. In particular, Mondrianþ revokes a partitioningfrom G into {G1, G2} only if either jG1jo‘ � SmaxðGÞ orjG2jo‘ � SmaxðGÞ holds. In other words, Mondrianþallows a partitioning when each QI-group generated fromthe partitioning is able to satisfy ‘-diversity in the worst-case scenario looked ahead by SmaxðGÞ, i.e., when allSmaxðGÞ tuples with the most frequent SA values are allpartitioned to the same QI-group. The reason is, for each

Page 11: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

Fig. 2. Fix Mondrian ‘-diversity (‘¼ 2). (a) Original Mondrian. (b) Mondrianþ .

X. Jin et al. / Information Systems 36 (2011) 859–880 869

G1, we have

SmaxðG1Þ

jG1jr

SmaxðGÞ

jG1jr

SmaxðGÞ

‘ � SmaxðGÞ¼

1

‘: ð20Þ

For the same reason, SmaxðG2Þ=jG2jr1=‘ holds as well.Fig. 2b illustrates our Mondrianþ algorithm. First,

Mondrianþ manages to generate two QI-groups G1¼{t4,t5, t7, t8} and G2¼{t1, t2, t3, t6} after acting the very samefirst partitioning as the original Mondrian does in Fig. 2a.The reason is both G1 and G2 can achieve 2-diversity evenif the worst-case happens, i.e., jG1j ¼ 4Z‘ � SmaxðGÞ ¼ 2�2¼ 4 and jG2j ¼ 4Z‘ � SmaxðGÞ ¼ 2� 1¼ 2. However, G1 ispublished as is without any further partitioning, becausethe size of any QI-group g partitioned from G1 has to beless than jG1j ¼ 4 and thus jgjZ‘ � SmaxðG1Þ ¼ 2� 2¼ 4cannot be satisfied. In other words, it is impossible toachieve 2-diversity in the worst-case scenario. Unlike theoriginal Mondrian at this time point, the QI–SA correla-tion, i.e., SmaxðG1Þ ¼ 2 consulted by Mondrianþ canalways be recovered from the published data. For exam-ple, we can deduce it from counting the SA values in thepublished Group1. Following the same fashion, it is easilyverified that G2 would be partitioned by Mondrianþ andpublished as Group2 and Group3.

It can be easily proved that Mondrianþ followsTheorem 5.3. First, the first condition of Theorem 5.3 isautomatically satisfied because Mondrianþ only perturbsQI without the SA perturbation. Likewise, the secondcondition holds because the QI perturbation in Mondrianþonly uses the original QI attributes and the published SA(i.e., Smaxð�Þ). Therefore, we have the following theorem:

Theorem 6.1. Mondrianþ‘-diversity algorithm is simulatable.

Algorithm 1 details the steps of Mondrianþ withminor change against the original Mondrian. Line 4implements the idea of global look-ahead. MONDRIAN(G, k) in Line 5 denotes a function that invokes k-anonymity Mondrian algorithm [8] on the dataset G.MONDRIAN(G, k) can be replaced with other k-anonymityalgorithms (e.g., K-OPTIMIZE [26], Datafly [29]). The timecomplexity is OðnðlognÞ2Þ, where n is the number of tuplesin the microdata table T. In particular, the number ofiterations from Lines 2 to 11 is at most OðlognÞ. Following

the analysis in [8], Line 5 takes OðnlognÞ time. Hence, theoverall time complexity is OðnðlognÞ2Þ.

Algorithm 1. Simulatable Mondrianþ algorithm.

1:

QIGroup’|. InputSet’fTg.

2:

repeat 3: G’ the largest group in InputSet.

4:

if jG1jZ‘ � SmaxðGÞ && jG2jZ‘ � SmaxðGÞ then 5: fG1 ,G2g’mondrianðG,‘ � SmaxðGÞ).

6:

QIGroup’fQIGroup\Gg [ fG1 ,G2g.

7:

else 8: InputSet’InputSet\G.

9:

QIGroup’QIGroup [ G.

10:

end if 11: until InputSet¼ |.

12:

return QIGroup.

6.2.2. From Hilb to Simulatable Hilbþ

We now discuss how to transform the state-of-the-art ‘-diversity algorithm Hilb [11] to be simulatable viathe global look-ahead. The original Hilb works asfollows. First, it transforms by Hilbert-curve themulti-dimensional QI space of a microdata table T to asingle-dimensional space QT. Based on QT values, Hilbsorts all tuples in T in an ascending order, and buck-etizes all the ordered tuples in terms of their SA values.Fig. 3a shows a simple example of the original Hilb.Suppose t1-t8 is the ascending order. All the eighttuples are bucketized into six buckets (because thereare six different SA values) and ordered ascendinglyfrom t1 to t8 based on their QT values.

Second, Hilb greedily splits out one QI-group G1 eachtime by picking up jG1j (initially ‘) tuples from distinctjG1j buckets with lowest QT in G (initially T), and progres-sively increments jG1j by one when jGj�jG1jo‘�

SmaxðG\G1Þ. The previous greedy step has to stop whenjG1j4m where m is the number of buckets. At this point,Hilb enters the roll-back step by restoring jG1j to ‘, runs ina similar fashion to Anatomy [30], i.e., picking up jG1j

tuples from the jG1j largest buckets (in terms of thenumber of remaining tuples in the bucket), and producesthe QI-group G1 if jGj�jG1jZ‘ � SmaxðG\G1Þ. Such a QI-group G1 at the roll-back step can be proved to alwaysexist by incrementing jG1j progressively by one each time.

Page 12: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

t1 t2 t3 t4 t5 t6 t7 t8

Group1 Group2 Group3 Group4

t1 t2 t3 t4 t5 t6 t7 t8

Group1 Group2 Group3

Fig. 3. Fix Hilb ‘-diversity (‘¼ 2). (a) Original Hilb. (b) Hilbþ .

X. Jin et al. / Information Systems 36 (2011) 859–880870

Once such G1 is found, Hilb generates it and returns to thefirst greedy step if there are any tuples left in the buckets.

Fig. 3a illustrate an example of Hilb. Hilb starts withsplitting out G1¼{t1, t2} with the lowest QT in G¼T from twodistinct buckets. G1 is published as Group1 because theremaining tuples can achieve 2-diversity, i.e., jGj�jG1j ¼

6Z‘ � SmaxðG\G1Þ ¼ 2� 2¼ 4 holds. However, after Group2¼ {t3, t4} is generated for the same reason, Hilb cannot splitout any QI-group from the remaining G¼{t5, t6, t7, t8} in thegreedy step, even by incrementing jG1j progressively untiljG1j reaches m¼6. The reason is that when jG1jo6, it isimpossible to find a QI-group G1 with the lowest QT suchthat jGj�jG1jZ‘ � SmaxðG\G1Þ. When jG1j is incremented to6, Hilb enters the roll-back by restoring jG1j ¼ 2. It locatesjG1j distinct buckets that have the largest number of tuples,and generates {t5, t7} as to be Group3 picked from thesebuckets. The remaining {t6, t8} is then published as Group4.

As we can see, the problem of Hilb is to consult theunpublished QI–SA correlation (i.e., SmaxðG\G1Þ) at the timeof incrementing jG1j, as well as at the time of transitingfrom the greedy to the roll-back step when jG1j ¼m. Let usfocus on G¼{t5, t6, t7, t8} and simply consider the greedystep when jG1j ¼ 3, for example. SmaxðG\G1Þ ¼ Smaxðft8gÞ ¼ 1is the only information that is consulted by Hilb to fail thetrial of splitting out G1¼{t5, t6, t7} and to drive incrementingjG1j to 4. However, Smaxðft8gÞ ¼ 1 can never be recoveredfrom the published QI-groups in Fig. 3a. Therefore, likeMondrian, Hilb violates the second condition of Theorem 5.3while the first condition is automatically satisfied.

Our basic idea of altering Hilb to Hilbþ by using theglobal look-ahead is to split out a QI-group G1 from G onlywhen the remaining tuples G\G1 can achieve ‘-diversityin the worst-case scenario, i.e., jGj�jG1jZ‘ � SmaxðGÞ. Insteadof using SmaxðG\G1Þ, SmaxðGÞ can be recovered in thepublished data, no matter whether G1 is ultimately split outor not. Fig. 3b illustrates the procedure Hilbþ . Group1 andGroup2 are generated sequentially in the same fashionbecause jGj�jG1j ¼ 8�2¼ 6Z‘ � SmaxðGÞ ¼ 2� 2¼ 4 andjGj�jG1j ¼ 6�2¼ 4Z‘ � SmaxðGÞ ¼ 4 is satisfied, respec-tively. In other words, the remaining tuples after Group1 (orGroup2) is split out can satisfy 2-diversity in the worst-case.Unlike Hilb in Fig. 3a, Hilbþ chooses to publish the entireG¼{t5, t6, t7, t8} as Group3, rather than split out any G1 fromit. The reason is that the remaining tuples cannot satisfy 2-

diversity in the worst-case, i.e., ðjGj�jG1jÞ ¼ 2o‘�

SmaxðGÞ ¼ 2� 2¼ 4. Note that at this point, Hilbþ does notneed to greedily repeat the process by incrementing jG1j,because there is no jG1

0j such that jG1

0j4 jG1j and mean-

while ðjGj�jG10jÞZ‘ � SmaxðGÞ can be satisfied. Clinging to

SmaxðGÞ to look-ahead a worst-case scenario, Hilbþ is able tosatisfy the second condition as well as the first condition ofTheorem 5.3. Therefore, we can have the following theorem:

Theorem 6.2. Hilbþ‘-diversity algorithm is simulatable.

Algorithm 2 describes the details of Hilbþ ‘-diversityalgorithm. Lines 1–4 pre-processes the microdata T byHilbert curve transformation, sorts and bucketizes tuplesas with the original Hilb. Lines 6–14 describe the greedyprocedure of splitting out a QI-group G1 from G. Line 7implements the global look-ahead. To avoid using theunpublished QI–SA correlation, i.e., SmaxðG\G1Þ, there areno trials to increment jG1j or to invoke the roll-back stepas explained before. Refer to the analysis of Hilb in [11].The overall time complexity of Algorithm 2 is at mostOðnlognÞ, where n is the number of tuples in T.

Algorithm 2. Simulatable Hilbþ algorithm.

1:

QIGroup’|. G’fTg.

2:

Apply Hilbert curve to transform multi-dimensional QI space of

G into one-dimensional space QT. Sort all the tuples in G in

ascending order of QT.

3:

Split sorted tuples in m buckets based on SA values.

4:

frontier F’ set of first record in each bucket.

5:

repeat 6: jG1j’‘.

7:

if ðjGj�jG1jÞo‘ � SmaxðGÞ then 8: jG1j’ jGj.

9:

end if 10: G1’ set of jG1j tuples of F with lowest QT.

11:

G’G\G1.

12:

Update F .

13:

QIGroup’QIGroup [ G1.

14:

until G¼ |.

15:

return QIGroup.

6.3. Local look-ahead

Global look-ahead addresses a large class of existingalgorithms that perturb QI of the microdata data while

Page 13: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

Fig. 4. Fix MASK (m¼3). (a) k-anonymity (k ¼ 4). (b) Original MASK. (c) MASKþ .

X. Jin et al. / Information Systems 36 (2011) 859–880 871

leaving SA intact. Local look-ahead, on the other hand,deals with existing algorithms that perturb the SA valuesbefore publishing a microdata table.

Before perturbing SA, the existing algorithms (e.g., MASK[12]) need to first check the original SA distribution of a QI-group, in order to determine whether the privacy guarantee(e.g., ‘-diversity ) is violated. Unfortunately, as we discussedin Section 5.2, when the privacy guarantee is indeedviolated by the original distribution, then the checkingprocess itself uses certain QI–SA correlation that cannot befinally published, therefore incurring algorithm-based dis-closure. Local look-ahead enables query of the original SAdistribution without using any QI–SA information that willnot eventually be published. To achieve this, local look-ahead retrieves the frequency of one SA value at a time, after

ensuring that the retrieval will never lead to a violation ofthe privacy guarantee even in the worst-case scenario. Tomake it more concrete, we demonstrate as follows how touse local look-ahead to eliminate algorithm-based disclo-sure from the m-confidentiality algorithm MASK [12].

We start with a brief review and example of theoriginal MASK algorithm. As discussed in Section 2.2, toachieve m-confidentiality, MASK first applies an existingk-anonymization algorithm (e.g., [9,26,29]) to partitionthe microdata table into a number of k-anonymousQI-groups, and then perturbs the SA distribution of QI-groupswhich violate ‘-diversity with ‘¼m. The parameterk must be at least m and can be specified as an inputparameter.6 Figs. 4a and b illustrate an example of thetwo steps when m¼3. After the first step, Group 1: {t4, t5,t7, t8} and Group 2: {t1, t2, t3, t6} are two QI-groups whichsatisfy k-anonymization with k¼4. In the second step,since only one QI-group, Group 1, violates 3-confidenti-ality (as jGroup 1j ¼ 4om� SmaxðGroup 1Þ ¼ 3� 2¼ 6,where Smaxð�Þ is the maximum frequency of an SA valuein a given QI-group), MASK perturbs the SA distribution ofGroup 1 to satisfy 3-confidentiality while publishingGroup 2 as is, as shown in Fig. 4b. One can see that theinformation of SmaxðG1Þ ¼ 2, which was used to decide theperturbation of SA for Group 1, is not eventually

6 We assume bjGj=mc � jDSjZ jGj for any k-anonymous QI-group G

because otherwise MASK cannot achieve m-confidentiality.

published in Fig. 4b. Thus, MASK consults unpublishedQI–SA correlation and therefore violates the secondnecessary condition of ASAP (i.e., Theorem 5.4).

To fix the problem, we transform MASK to MASKþ bylocal look-ahead. A key observation here is that, whiledeciding whether to perturb the SA distribution of aQI-group G, one cannot directly query SmaxðGÞ because,as long as the returned result violates m-confidentiality,ASAP is already violated. To address this problem, locallook-ahead progressively queries the frequency of each SAvalue in an ascending order of its frequency (i.e., from theleast to the most frequent SA). To avoid reading any SAfrequency that violates m-confidentiality in G, after eachquery we ‘‘look ahead’’ a worst-case scenario of the SAfrequencies yet to be read, and stop if the worst-casescenario violates m-confidentiality.

Formally, let DS be the domain of SA and f1, . . . ,fjDS j be

the frequency of SA values in an ascending order—i.e.,fi is the number of tuples in a given QI-group G which

feature the i-th least frequent SA value si 2 DS, i 2 ½1,jDSj�

(break ties arbitrarily). Let fi¼0 for any si not appearingin G. We suppose in the following that jDSj4m holds(as a side note, it is trivial when jDSj ¼m, because eachSA value from DS must occur in each QI-group withrelative frequency at exactly 1/m). Note that MASKþcan always read f1, if jDSj4m, because it never violatesm-confidentiality. Otherwise, it implies that all theother SA frequencies violate m-confidentiality, leading

toPjDSj

i ¼ 1 fi4 jGj, which is impossible by the fact ofPjDS j

i ¼ 1 fi ¼ jGj. Suppose fi is the next SA frequency to be

read. MASKþ reads fi if and only if ðjGj�Pi�1

j ¼ 1

fj�fi�1Þ=ðjDSj�iÞr bjGj=mc. Otherwise, MASKþ switches

to perturbing the SA distribution of G which shall be

discussed next. Since fiZ fi�1, ðjGj�Pi�1

j ¼ 1 fj�fi�1Þ=ðjDSj�iÞ

essentially defines the worst-case scenario because:

firjGj�

Pi�1j ¼ 1 fj�fi

jDSj�irjGj�

Pi�1j ¼ 1 fj�fi�1

jDSj�irjGj

m

� �: ð21Þ

As such, no information used by MASKþ implies aviolation of m-confidentiality.

We now illustrate this process by an example.Consider the 4-anonymous Group 1 in Fig. 4a. It has

Page 14: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

X. Jin et al. / Information Systems 36 (2011) 859–880872

the following ascending order of SA frequencies,/fAIDS,fdiabetes,fheartdisease, fmumps,fcancer ,fgastritisS¼/0,0,0,0,2,2S. MASKþ can always safely read the 1st least fre-quent SA, i.e., fAIDS¼0, due to jDSj4m. Also, MASKþ cansafely read the 2nd least frequent SA, fdiabetes¼0, becausethe worst-case scenario cannot exceed bG=mc ¼ b4=3c ¼ 1(i.e., ðjGj�fAIDS�fAIDSÞ=ðjDSj�2Þ ¼ 4=4). In other words,there is no chance for AIDS and diabetes in Group 1 toviolate 3-confidentiality. Nonetheless, MASKþ will notread the 3rd least frequent SA, i.e., fheartdisease, becauseðjGj�fAIDS�fdiabetes�fdiabetesÞ=ðjDSj�3Þ ¼ 4=34b4=3c, whichmeans that heart disease in Group 1 may violate3-confidentially in the worst-case scenario. At this point,MASKþ decides to perturb the SA distribution of Group 1.

We now consider how MASKþ perturbs the SA per-turbation. An important requirement here is that allinformation ‘‘read’’ in local look-ahead must be derivablefrom the perturbed (and therefore published) SA distribu-tion. Again, let /f1, . . . ,fjDSjS be the ascending order of SAfrequencies in a QI-group G. Suppose fi is the last SAfrequency successfully read by MASKþ before it decidesto perturb SA in G. The QI-group G0 perturbed from G mustsatisfy that its top i least frequent SA values and theirfrequencies, which is the only information used inSA perturbation, must remain exactly the same as theoriginal G.

To achieve this, the SA perturbation process can be

described as follows. Let f 0j (j 2 ½1,jDSj�) be the published

frequency of the SA value corresponding to fj. To start, wemake no change to f1,y,fi and assign the value of fi to

f 0iþ1, . . . ,f 0jDSj—i.e., /f 01, . . . ,f 0i,f

0iþ1, . . . ,f 0jDS j

S¼/f1, . . . ,fi,

fi, . . . ,fiS. If jGj�PjDSj

k ¼ 1 f 0k40 holds, following a descend-

ing order from jDSj to iþ1, f 0j is iteratively updated to be

minðbG=mc,jGj�PjDS j

k ¼ 1 f 0kþ fiÞ until jGj�PjDS j

k ¼ 1 f 0k ¼ 0. At

this point, the published SA distribution satisfies m-confidentiality, and retains all the information used bylocal look-ahead.

Return to the running example of Group 1 in Fig. 4a.Recall that MASKþ decides to perturb SA after readingfAIDS¼0 and fdiabetes¼0. In the perturbation process,

MASKþ starts with /f 0AIDS,f 0diabetes,f0heart disease, f 0mumps,

f 0cancer ,f0gastritisS¼/0,0,0,0,0,0S. Since jGj�

PjDSj

k ¼ 1 f 0k ¼

440, it then updates f 0gastritis to be minð G=m� �

,jGj�PjDS j

k ¼ 1 f 0kþ fiÞ ¼minð 43

� �, 4�0þ0Þ ¼ 1. This procedure is

repeated until f 0mumps is updated to 1, after which

jGj�PjDSj

k ¼ 1 f 0k ¼ 4�4¼ 0. Fig. 4c shows the result of apply-

ing MASKþ to Fig. 4a. Unlike MASK, the only informationused by MASKþ in perturbing SA, i.e., fAIDS¼0 andfdiabetes¼0, can still be readily learned from Fig. 4c.

Algorithm 3. Simulatable MASKþ m-confidentialityalgorithm.

1:

QIGroup’|.

2:

DS’ the SA domain of T.

3:

G’ k-anonymous groups by applying any k-anonymization

algorithm on T where k is defined by the user.

4:

repeat 5: G1’ the next k-anonymous group from G.

6:

/f1 , . . . ,fjDS jS’ the ascending SA frequency order in G1.

7:

i’1

8:

repeat 9: i’iþ1.

10:

until i¼ jDSj�1 or

jGj�Pi�1

j ¼ 1fj�fi�1

jDS j�i 4 jGjm

j k.

11:

if io jDSj�1 then 12: Perturb SA in G1 to G01 such that the top i�1 least frequent

SA values and their frequencies in G1 retain exactly the same as

in G01.

13:

G1’G01.

14:

end if 15: QIGroup’QIGroup [ G1.

16:

until G¼ |.

17:

return QIGroup.

Algorithm 3 describes our simulatable MASKþ algo-

rithm altered from MASK to achieve m-confidentiality.

We now discuss why MASKþ satisfies both condi-tions of Theorem 5.4. The satisfaction of the firstcondition is straightforward because MASKþ only usesthe original QI information in its QI perturbation.Nonetheless, we discuss the satisfaction of the secondcondition as follows.

Note that the SA information used by MASKþ , i.e., thereal frequencies of a number of SA values as well as thefrequency order, is actually published by the algorithm forthe following reason: Consider m-confidentiality as theprivacy guarantee. What MASKþ uses in terms of SAinformation includes two parts: (1) safe-value frequencies:i.e., the real frequency of a number of SA values whichMASKþ reads with a ‘‘safety’’ guarantee of not violating m-confidentiality (we refer to these SA values as Type-1values) and (2) unsafe-value identities: i.e., the fact thatthe other SA values (which we refer to as Type-2 values)have frequencies greater than or equal to the frequencies ofSA values read by MASKþ . According to the SA-perturba-tion process used by MASKþ , both types of information canbe readily learned from the published table, as the frequen-cies of Type-1 SA values remain the same in the publishedtable, while the frequencies of Type-2 SA values are stillgreater than or equal to the frequencies of all Type-1 SAvalues. Thus, even the order information used by MASKþ ispublished in the perturbed table. As such, MASKþ does notviolate the conditions set forth by Theorem 5.4. Hence, wehave the following theorem:

Theorem 6.3. MASKþ m-confidentiality algorithm is

simulatable.

The time complexity of MASKþ is OðnlognÞ, where n isthe number of tuples in T. We analyze it as follows. SinceMASKþ uses k-anonymization as a black box, the complex-ity of the QI perturbation part is OðnlognÞ when ak-anonymization algorithm such as Mondrian [8] or Hilb[11] is used. The number of iterations (Lines 8–10) is atmost jV j where jV j is the number of k-anonymousQI-groups. Line 12 has complexity of OðjDSjÞ. Since n4 jV jand n4 jDSj, the overall time complexity is OðnlognÞ.

7. Enhancing data utility

Although both tools: global look-ahead and local look-ahead discussed in the previous section suffice to dismiss

Page 15: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

X. Jin et al. / Information Systems 36 (2011) 859–880 873

algorithm-based disclosure, their drawback is that the datautility may be reduced at the cost of enforcing a ‘‘morestringent’’ privacy guarantee. In this section, we introducestratified pick-up, another tool to enhance utility.

Stratified pick-up takes as input the anonymousQI-groups from any simulatable algorithm and tires tofurther partition each of these groups greedily basedsolely on the distinctness of SA values. The design of thisphase is principled on two objectives: (1) the algorithmshould still be simulatable and (2) each output QI-groupsize should be minimized.

In particular, a simple solution to achieve these twoobjectives is to apply Anatomy [10] on each generatedQI-group from the simulatable algorithm. Note that asdiscussed in Section 5.3, Anatomy does not perturb SAvalues and not use any QI–SA correlation beyond whatwill be eventually published. Thus, this solution satisfiesTheorem 5.3. For example, we apply the stratified pick-upon Group 1 generated from our Mondrianþ algorithm(in Fig. 5a). A possible output after performing stratifiedpick-up on Group 1 is shown in Fig. 5b, i.e., {{t4, t8},{t5, t7}}, which has obviously higher utility than publish-ing Group 1 as in Fig. 5a. Whereas, the generated Group 2in Fig. 5a is already minimized in terms of 2-diversity, andthus stratified pick-up publishes it as is in Fig. 5b. For-mally, we have the following theorem on the outputgroup size after stratified pick-up.

Theorem 7.1. Each ‘-diversity QI-group of stratified pick-

up contains ‘0 tuples where ‘0 2 ½‘,2‘Þ and each SA value is

unique.

Algorithm 4. Stratified Pick-up.

1:

QIGroup’|.

2:

InputSet’ anonymous groups from any simulatable algorithm.

3:

repeat 4: G’ the next anonymous group from InputSet.

5:

InputSet’InputSet\G.

6:

if ‘r jGj2 then

7:

fg1 , . . . ,gpg’AnatomyðG,‘Þ

8:

QIGroup’QIGroup [ fg1 , . . . ,gpg.

9:

else 10: QIGroup’QIGroup [ G.

Fig. 5. Apply stratified-pick on Mondrianþ ‘-diversit

11:

y (‘¼

end if

12: until InputSet¼ |.

13:

return QIGroup.

Details of stratified pick-up are shown in Algorithm 4.We test the condition ‘r jGj=2 in Line 6 because if a‘-diversity group G has size less than 2‘, G must have jGjdistinct SA values and cannot be further partitioned. Suchminimized group G can be directly added to the output(Line 10).

Theorem 7.2. If the algorithm which generates the input to

stratified pick-up is simulatable, then the algorithm with

stratified pick-up is still simulatable.

The efficiency of stratified pick-up depends on Anat-omy. Following the results from [10], the time complexityof stratified pick-up is O(n) and the I/O cost is OðlÞ, wheren is the total number of tuples and l is the count ofdistinct SA values.

Take Mondrian as an example. Algorithms 5 and 6detail a hybrid algorithm of integrating our two tools:(global) look-ahead (Line 2) (i.e., Algorithm 1) and strati-fied pick-up (Line 3) (i.e., Algorithm 4). The only differ-ence is their usage of generalization and bucketizationpublishing schemes, respectively. In the same fashion, it iseasy to develop a hybrid version: Hilbþþand MASKþþ. Wewill test them in the next section.

Algorithm 5. Mondrianþþ.

2)

1:

DoneSet’|. InputSet’fTg.

2:

DoneSet’MondrianþðInputSet,‘Þ

3:

DoneSet’Stratified_PickupðDoneSet,‘Þ

4:

return DoneSet.

Algorithm 6. Mondrianþþ (in bucketization scheme).

1:

DoneSet’|. InputSet’fTg.

2:

DoneSet’MondrianþðInputSet,‘Þ

3:

DoneSet’Stratified_PickupðDoneSet,‘Þ

4:

Publish the original QI (without any generalization) for each

group in DoneSet

5:

return DoneSet.

. (a) Mondrianþ . (b) Mondrianþþ.

Page 16: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

X. Jin et al. / Information Systems 36 (2011) 859–880874

8. Experiment

In this section, we describe our experimental setup,compare the data utility of our simulatable algorithmswith the existing ‘-diversity algorithms, evaluate theimpact of our two tools: global/local look-ahead andstratified pick-up, and evaluate the extent of algorithm-based disclosure in MASK and Hilb, respectively.

8.1. Experimental setup

8.1.1. Hardware

All experiments were conducted on a machine withIntel Core 2 Duo 2.6 GHz CPU with 2 GB RAM andWindows XP OS. All our algorithms were implementedusing Cþþ.

8.1.2. Datasets

We conducted the experiments on two datasets: Adult,from UCI Machine Learning Repository and Census, fromhttp://ipums.org, which have been extensively used asbenchmarks in the literature. For the Adult dataset, weremoved all tuples with missing values to obtain a set of45,222 tuples. For the Census dataset, we followed theprocedure in [10] to sample 300,000 tuples withoutreplacement as our testing bed. Their schemas are sum-marized in Table 6.

8.1.3. Utility measure

We adopted the same relative error measure proposedin [10]. Consider query workload of the form:

TabThe

A

(a

A

W

M

O

R

S

C

S

E

(b

A

G

E

M

R

W

C

O

SELECT COUNT(*) FROM Dataset

WHERE predðQ1Þ, . . . ,predðQqdÞ,predðSÞ

where qd is the query dimension and pred(Qi) (resp. pred(S))denotes the predicate of Qi (resp. S) belonging to a range ofrandomly generated values in its domain. The cardinalityof the range is determined by a parameter called

le 6attributes and its domains in our experiment.

ttribute Domain size Type Height

) Adult dataset

ge 74 Ranges-5, 10, 20 4

ork class 7 Taxonomy tree 3

arital status 7 Taxonomy tree 3

ccupation 14 Taxonomy tree 2

ace 5 Taxonomy tree 2

ex 2 Suppression 1

ountry 41 Taxonomy tree 3

alary 2 Suppression 1

ducation 16 Sensitive attr. –

) Census dataset

ge 79 Numerical –

ender 2 Suppression 1

ducation 17 Numerical –

arital status 6 Taxonomy tree 3

ace 9 Taxonomy tree 2

ork class 10 Taxonomy tree 4

ountry 83 Taxonomy tree 3

ccupation 50 Sensitive attr. –

selectivity. Let Act and Est be the query result from themicrodata table T and published table T*, respectively. Therelative error is defined as jAct�Estj=Act. For each set ofexperiments, we ran a workload of 10,000 queries, andcalculated the average relative error as the utility measure.

8.2. Evaluation of MASKþþ

We used the same settings as [12]: Adult dataset withattribute Education as SA, and all values below highschool, i.e., ‘‘preschool’’, ‘‘1st–4th’’, ‘‘5th–6th’’ and ‘‘7th–8th’’, are sensitive. The total number of sensitive tuples is1566. MASK features two main parameters k and m for itstwo steps: k-anonymization and SA perturbation, respec-tively. We tested varying values of k 2 ½5,8� and m 2 ½2,5�.

Before evaluating our MASKþþ, we first tested theextent of algorithm-based disclosure in MASK [12]. Then,we compared our MASKþþ (integrated with both locallook-ahead and stratified pick-up) against the originalMASK [12]. For the fairness of comparison, we did notcompare MASKþþ with other data publishing algorithmsbecause the authors’ implementation http://www.cse.ust.hk/�raywong/code/cred.zip does not address the casewhen all the SA values are sensitive.

8.2.1. Algorithm-based disclosure of MASK

Consider a simple attack based on a negative associa-tion rule ‘‘few (e.g., less than 10%) people under or at theage of 25 have Ph.D. degrees’’. This can be drawn eitherintuitively or from many educational surveys. Recall fromSection 2 that MASK incurs algorithm-based disclosure ifa published group violates such rule because, in that case,an adversary can infer that any individual in that groupmust have probability over 1=m to have a sensitive SAvalue. As such, our attack can be summarized as follows:if any ‘‘Ager25’’ group in the published table hasover 10% SA values of ‘‘Ph.D.’’, we label such a groupas problematic group and infer that it originally violatesm-confidentiality. That is, the education level of eachindividual in that group can be inferred more than 1=m

probability to be below high school. We refer to thisattack as PhD-25 attack.

In our experiments, we found that the PhD-25 attacknever generates any false positive. That is, it nevermislabels a group without perturbing SA to be proble-matic. Thus, the confidence of PhD-25 attack on predict-ing a violation of m-confidentiality is 100%. For the recallof this attack, we tested the number of tuples compro-mised by PhD-25 attack.

First, we tested the extent of algorithm-based disclo-sure by varying the two parameters of MASK: k and m.Fig. 6a shows the number of compromised tuples whenk 2 ½5,8� and m 2 ½2,5�. One can see that, given k, thenumber of compromised tuples increases with m. This isbecause a larger m requires more groups to have SAperturbed after k-anonymization, thus increasing theprobability of ‘‘Ph.D.’’ being added to an ‘‘Ager25’’ group.As a result, more tuples are compromised under the PhD-25 attack. On the other hand, given m, a larger k does notnecessarily increase the number of compromised tuples.

Page 17: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

m value

Fig. 6. Adult dataset, algorithm-based disclosure of MASK.

X. Jin et al. / Information Systems 36 (2011) 859–880 875

This is because the value of k is generally independent ofthe probability for a QI-group to violate m-confidentiality.

Second, we tested the extent of algorithm-based dis-closure by varying the percentage of sensitive tuples inthe microdata table (through stratified sampling withreplacement). Fig. 6b depicts the number of compromisedtuples when the percentage ranges from 10% to 20%.We set k¼5 and m 2 ½2,5�. One can see from the figurethat more tuples will be compromised when the percen-tage of sensitive tuples increases. That is because, withmore sensitive tuples, more groups are likely to violate m-confidentiality, leading to more ‘‘Ph.D.’’ being added to an‘‘Ager25’’ group.

8.2.2. Utility comparison with MASK

We illustrated the results in generalization schemefrom Fig. 7a to e, and the results in bucketization schemefrom Fig. 7f to j, respectively. In particular, Fig. 7a inves-tigates the tradeoff between the query accuracy andk-anonymity value when fixing m-confidential valuem¼3, number of QI qi¼8, query dimension qd¼3 andselectivity s¼5%. Without changing the values of otherparameters, Fig. 7b sets k¼50 and explores the queryaccuracy by varying m. Fixing k¼50 and m¼3, Fig. 7c to estudies the query accuracy by varying the number of QI qi,by varying query dimension qd, and by varying selectivitys, respectively. We repeated the same evaluation inbucketization scheme by using the same configuration(Fig. 7f–j).

All the above figures show that our MASKþþ outper-forms the original MASK in terms of utility. Nonetheless,when k is close enough to the m-confidential value, e.g.,k¼3 or 5 in Fig. 7a and f, the superiority of MASKþþ overthe original algorithm becomes less significant (or evendisappears in the case of bucketization scheme). To explainit, smaller k generates more small-sized k-anonymousgroups, and thus the effect of our stratified pick-up isweakened. Another interesting point is that the superiorityof MASKþþ is affected when m is larger (see Fig. 7b and g).The reason is that as discussed in Section 6.3, local look-ahead defines a more ‘‘stringent’’ condition with larger m,and thus our MASKþþ tends to perturb SA in morek-anonymous groups, leading to lower query accuracy.

8.2.3. Effects of local look-ahead and stratified pick-up

As we evaluated in Section 8.3.4, Fig. 7k demonstratesthe individual effect of local look-ahead and stratified

pick-up. We set qi¼8, qd¼3, sel¼5% and performed theevaluation on k¼10 and 50, respectively. As expected,when k¼10, the query accuracy of MASKþ (i.e., inte-grated only with local look-ahead) is comparable (nobetter than) the original algorithm. However, MASKþþ(i.e., integrated both local look-ahead and stratifiedpick-up) achieves better utility than the original MASK.Likewise, we observed the same pattern from the casewhen k¼50.

8.2.4. Time performance

Table 7 shows the time performance of our MASKþþagainst the original MASK, by fixing k¼50, qi¼8, qd¼3,sel¼5%. At the cost of eliminating algorithm-based dis-closure, MASKþþ incorporates local look-ahead, whichdefines a more ‘‘stringent’’ condition in SA perturbation,leading to SA perturbation in more QI-groups than theoriginal MASK. Therefore, MASKþþ runs slower thanMASK but their performances are still comparable.

8.3. Evaluation of Mondrianþþ and Hilbþþ

In this set of experiment, we used another populardataset—census following the same setting as in [11,10].We first tested the extent of algorithm-based disclosure ofHilb, the sate-of-the-art algorithm for ‘-diversity. Then,we evaluated Mondrianþþ and Hilbþþ, our simulatablealgorithms (with both global look-ahead and stratifiedpick-up) against the original Mondrian [9] and Hilb [11] ingeneralization scheme. For the fairness of comparison, wealso considered Mondrianþþ and Hilbþþ in bucketizationscheme (see Section 7), and compared them with theexisting simulatable publishing algorithm Anatomy [10].We then demonstrated the effect of our two tools: globallook-ahead and stratified pick-up individually. Finally, wetested the efficiency.

8.3.1. Algorithm-based disclosure of Hilb

We illustrate a simple attack on Hilb by using Fig. 8 asan example. Fig. 8 illustrates four QI-groups produced bya 2-diversity Hilb algorithm. t1–t8 represent a completelist of ascending-ordered tuples for a Hilbert curvedefined on the QI-space (see Section 6.2.2 for the algo-rithm details of Hilb).

Observe from all published QI-groups that the max-imum frequency of an SA value in {t1, t2, t3, t4, t5, t6, t7, t8} is2, i.e., Smaxðft1,t2,t3,t4,t5,t6,t7,t8gÞ ¼ 2. Thus, we are able to

Page 18: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

0.00E+00

4.00E-01

8.00E-01

1.20E+00

1.60E+00

2.00E+00

rela

tive

erro

r

MASK MASK++

3k value

0.00E+00

4.00E-02

8.00E-02

1.20E-01

rela

tive

erro

r

MASK MASK++

bucketization

0.00E+00

1.00E-01

2.00E-01

3.00E-01

4.00E-01

2m value

rela

tive

erro

r

MASK MASK++

bucketization

0.00E+00

4.00E-02

8.00E-02

1.20E-01

3number of QI

rela

tive

erro

r

MASK MASK++

bucketization

0.00E+002.00E-024.00E-026.00E-028.00E-021.00E-011.20E-011.40E-011.60E-01

1query dimension

rela

tive

erro

r

MASK MASK++

bucketization

0.00E+002.00E-024.00E-026.00E-028.00E-021.00E-011.20E-011.40E-011.60E-01

1sensitivity (%)

rela

tive

erro

r

MASK MASK++

bucketization

5 10 20 30 40 50 60 70 80 90100

3k value

5 10 20 30 40 50 60 70 80 90100

3 4 5 6 7 8 4 5 6 7 8 2 3 4 5 6 7 8

2 3 4 5 6 7 8 9 10

Fig. 7. Adult dataset, MASKþþ. (a)–(e) Generalization scheme. (f)–(j) Bucketization scheme. (k) Effect of local look-ahead and stratified pick-up.

X. Jin et al. / Information Systems 36 (2011) 859–880876

infer that t1 and t2 must have the same SA value becauseotherwise, {t1, t2}, with lower information less than {t1, t3},would have formed Group1 because the remaining tuples{t3, t4, t5, t6, t7, t8} would satisfy 2-diversity anyway (asSmaxðft3,t4,t5,t6,t7,t8gÞ ¼ Smaxðft1,t2, t3,t4,t5,t6,t7,t8gÞ ¼ 2).With the intersection of the SA values between publishedGroup1 and Group2, we can safely conclude that t1 and t2

must have SA value ‘‘gastritis’’. Similarly, we can alsoderive that t3 and t4 must have ‘‘heart disease’’ and‘‘diabetes’’, respectively. Hence, the number of compro-mised tuples in our example is 4.

A more detailed description of our attack is given inAlgorithm 7. The main idea of our attacking is based onthe correlation between QI-groups generated by determi-nistic grouping (i.e., without performing the fall-back

step) of Hilb. Lines 1–2 sort the published QI-groupsascendingly by the respective lowest QT of their members.For example, the published four QI-groups in Fig. 8 are sortedascendingly as follows: ft1,t3g!QT

ft2,t4g!QTft5,t7g!QT

ft6,t8g, where !QTdenotes the partial order.

Algorithm 7. Algorithm-based disclosure attack againstHilb.

1:

G’ all QI-groups published by Hilb.

2:

Calculate QT for each tuple in G, where QT is one-dimensional QI

space transformed by Hilbert curve. Order QI-groups in G

ascendingly by the respective lowest QT in each group.

3:

repeat 4: G1’ select the first QI-group in G.

5:

if ðjGj�jG1jÞZ‘ � SmaxðGÞ then x A sufficient condition to

find deterministic QI-groups.

Page 19: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

Table 7Time performance of MASKþþ.

MASK MASKþþ MASKþþ

(in bucketization)

Time (in seconds) 65.7 66.3 66.3

Fig. 8. A simple attack on Hilb (2-diversity).

Fig. 9. Census data, algorithm-based disclosure of Hilb.

X. Jin et al. / Information Systems 36 (2011) 859–880 877

6:

~G’ all QI-groups G2 such that G2 2 G\G1 and there exists a

tuple t 2 G2 but t=2G1 whose QT falls into the range of QT in G1.

7:

if ~Ga| && jG1jþ1o‘ � 2. then 8: Label G1 as vulnerable group.

9:

end if 10: else 11: Goto Line 14; x Early termination.

12:

end if 13: G’G\G1.

14:

until G¼ |.

15:

return.

Line 5 suffices to guarantee that all the QI-groupsattacked by our algorithm are generated by deterministicgrouping. As explained in Section 6.2.2, the remainingtuples (i.e., G\G1) after generating G1 can achieve‘-diversity in the worst-case scenario, and thus rando-mized grouping would never be initiated by Hilb due toits utility concern. For the same reason, we can easilyverify that Group1 and Group2 in Fig. 8 must be generatedby deterministic grouping.

Observe from Hilb that given any deterministic QI-group G1, the SA value of a tuple t=2G1 must appear in G1, ift was not assigned to any group prior to G1 and the QT of t

falls exactly into the QT range of G1. The reason is thatotherwise, t would have been included when generatingG1, leading to lower information loss. Therefore, theindividual SA values in G1 may be compromised by theintersection of SA values between G1 and G2. Lines 6–7describe the above procedure. Note that Line 7 defines thecase when G1 [ ftg violates ‘-diversity. Since SA values ineach QI-group generated by Hilb are distinct from eachother, the number of the most frequent SA values in G1 [

ftg is 2. Then, we have

2

jG1jþ14

1

‘) jG1jþ1o‘ � 2,

when G1 [ ftg violates ‘-diversity.Let m be the number of QI-groups generated by Hilb.

It is easy to see that the time complexity of Algorithm 7 isO(m2).

Fig. 9 shows the extent of algorithm-based disclosureof Hilb on the Census dataset by varying ‘ from 2 to 6. Onecan see that the number of compromised tuples increasesmonotonically with ‘. The reason is that a larger ‘

significantly increases the probability of finding tupleswith the same SA value from different yet correlatedgroups. It is important to note that the percentage ofcompromised tuples under our attack can be as high as259,937/300,000¼86.6% (when ‘¼ 6).

8.3.2. Utility comparison with Mondrian and Hilb

We fixed the number of QI qi¼7, query dimensionqd ¼ 3, selectivity s¼5%. Fig. 10a illustrates the utility ofMondrianþþ and Hilbþþ when varying ‘ value. It showsthat both Hilbþþ and Mondrianþþ are not only able toeliminate the algorithm-based disclosure by global look-ahead, but also able to attain via stratified pick-upcomparable or even better utility against Hilb, andMondrian respectively.

Next, we set ‘¼ 4, and varied qi from 3 to 7. Fig. 10bshows the impact of qi on the utility. As we see, Hilbþþprovides comparable utility with Hilb in all the cases.Whereas, when qir5, Mondrianþþ achieves less accuracythan Mondrian. Nonetheless, the accuracy difference isdecreasing when qi increases. This is because Mondriantends to generate larger groups when there are moreQI attributes. This makes the utility improvement bystratified pick-up in Mondrianþþmore significant.

Fig. 10c examines the utility of Mondrianþþ andHilbþþ when query dimension qd ranges from 2 to 5,and ‘¼ 4, qi¼7, sel¼5%. Fig. 10d investigates the effect ofselectivity sel on the utility when ‘¼ 4, qi¼7, qd¼3. Onecan see in both two figures that Mondrianþþ and Hilbþþmaintain comparable or significantly better utility (in thecase of Mondrianþþ).

8.3.3. Utility comparison with Anatomy

We now performed the evaluation on the bucketizationpublishing scheme, where Mondrianþþ and Hilbþþ werecompared against Anatomy [10]. Recall that since Anatomyis simulatable, the objective of employing Mondrianþþ andHilbþþ in bucketization scheme is to provide better utilityby taking into account QI-locality information. Using the

Page 20: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

0.00E+001.00E-022.00E-023.00E-024.00E-025.00E-026.00E-027.00E-028.00E-02

2l-value

rela

tive

erro

r

Anatomy Mondrian++ Hilb++

bucketization

0.00E+001.00E-022.00E-023.00E-024.00E-025.00E-026.00E-027.00E-028.00E-02

3number of QI

rela

tive

erro

r

Anatomy Mondrian++ Hilb++

bucketization

0.00E+001.00E-022.00E-023.00E-024.00E-025.00E-026.00E-027.00E-028.00E-02

2query dimension

rela

tive

erro

r

Anatomy Mondrian++ Hilb++

bucketization

0.00E+002.00E-024.00E-026.00E-028.00E-021.00E-011.20E-011.40E-01

12345678910selectivity (%)

rela

tive

erro

r

Anatomy Mondrian++ Hilb++

bucketization

3 4 5 6 4 5 6 7

3 4 5

Fig. 10. Census dataset, Mondrianþþ and Hilbþþ. (a)–(d) Generalization scheme. (e)–(h) Bucketization scheme. (i) Effect of global look-ahead and

stratified pick-up.

X. Jin et al. / Information Systems 36 (2011) 859–880878

same parameter settings as the previous generalizationcase, we conducted experiments as shown from Fig. 10e toh. As expected, both Mondrianþþ and Hilbþþ significantlyoutperform Anatomy in terms of utility.

8.3.4. Effects of global look-ahead and stratified pick-up

We previously integrated both tools: global look-ahead and stratified pick-up. Now, we demonstrated theeffect on utility of each tool separately. We set qi¼7,qd¼3, sel¼5%, and tested the cases when ‘¼ 8 and 10.The reason why we chose higher ‘ values is only for theease of illustration, because Hilbþþ achieves almost thesame query accuracy as Hilb when ‘r7.

To show the effect of our first tool (i.e., global look-ahead), we tested Mondrianþ and Hilbþ in Fig. 10i,which were adapted from Mondrian and Hilb via inte-grating only the first tool (i.e., without stratified pick-up).We compared them against the original Mondrian andHilb. As expected, both adapted algorithms achieve lessutility than Mondrian and Hilb, respectively, as the cost ofeliminating algorithm-based disclosure.

To show the effect of our second tool (i.e., stratifiedpick-up), we compared algorithms integrated with bothtools (i.e., global look-ahead and stratified pick-up)against the previously developed algorithms with onlythe first tool. As we can see from Fig. 10i, stratified pick-

up improves the utility, leading to comparable or evenbetter utility than Hilb and Mondrian, respectively.

8.3.5. Time performance

We set ‘ to 4. Table 8 depicts the running time ofAnatomy, Mondrian, Mondrianþþ, Hilb and Hilbþþ. Again,we considered Mondrianþþ and Hilbþþ in two differentschemes, i.e., generalization and bucketization. Asexpected, different schemes does not affect the timeperformance of Mondrianþþ and Hilbþþ. Either Mon-drianþþ or Hilbþþ (in both two schemes) runs muchfaster than the original algorithm because, as discussedin Section 6.2.1, global look-ahead defines a more ‘‘strin-gent’’ condition, and thus either Mondrianþþ or Hilbþþtends to earlier terminate than their original algorithm. Asanalyzed in Section 6 of the time complexity, Table 8confirms the running time of Hilbþþ to be lower thanMondrianþþ, but higher than Anatomy.

9. Related work

Since the introduction of k-anonymity [1] and‘-diversity [2], various privacy models have been pro-posed including ða,kÞ-anonymity [3], personalized privacy[30], t-closeness [4], (k, e)-anonymity [5], ðe,mÞ-anonym-ity [31], etc. To achieve these privacy models, researchers

Page 21: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

Table 8Time performance.

Mondrian Mondrianþþ Mondrianþþ

(in bucketization)

Hilb Hilbþþ Hilbþþ

(in bucketization)

Anatomy

Time (in seconds) 9.2 2.2 2.2 2.1 0.6 0.6 0.3

X. Jin et al. / Information Systems 36 (2011) 859–880 879

studied numerous PPDP algorithms [7–11,26,28,32–40].Orthogonal to the above anonymization techniques at thetuple level, a recent work [41] studied PPDP with multipleprivacy rules by focusing on the schema level.

There has been a large body of work on addressing thethreats from external knowledge held by adversaries.Refs. [23,42–44] considered the knowledge about anindividual or relationship between individuals. Ref. [45]studied the presence of corruption. Ref. [21] studied thenegative association rule. Refs. [22,46] studied the privacydisclosure from learning whether a certain individual ispresent in the database or not.

The problem of algorithm-based disclosure was intro-duced by [12,14]. Ref. [12] provided a new privacy modelm-confidentiality and designed a new algorithm MASK toachieve it. However, we have shown in Section 2.2 thatthe MASK algorithm in [12] is still vulnerable to algo-rithm-based disclosure. Ref. [14] defined another newprivacy model called p-safety to address the problem,but its efficiency is a problem. Differential privacy pro-posed in [15] is able to eliminate algorithm-based dis-closure by providing a new privacy model and developinga corresponding algorithm. Orthogonal to all the abovework, the focus of our paper as mentioned in our Intro-duction part is to develop generic tools to adapt theexisting data publishing algorithms, such that these algo-rithms can be immune from algorithm-based disclosure.Ref. [47] proposed a k-jump strategy to transform anunsafe algorithm to a large family of distinct safe algo-rithms, but its time complexity is exponential.

Cormode et al. [48] proposed an elegant ‘‘symmetric’’method to defend against minimality attack, while con-cluding by theoretical analysis that minimality attack canonly render a constant increase on the adversarial beliefabout individual SA information. While their work isinteresting and solid, we argue that our work differs fromtheirs significantly on the following two key points:

In terms of the degree of algorithm-based disclosure, wedesign and evaluate algorithm-based attacks beyond thescope considered in Cormode et al. In particular, animportant assumption Cormode et al. make while quan-tifying the impact of minimality attack theoretically isthat the adversarial knowledge does not involve any SAdistribution. As such, their analysis does not apply toalgorithm-based disclosures which would occur whenan adversary holds such prior knowledge on SAdistribution—e.g., the algorithm-based disclosure of SA-perturbation-based algorithms such as MASK (recallfrom Section 2 that the algorithm-based disclosure ofMASK can be exploited by adversaries with externalknowledge such as ‘‘Japanese have an extremely lowincidence of heart disease’’—nonetheless, the disclosure

itself is still caused by knowledge of the algorithm, notthe external knowledge). As such, the conclusions inCormode et al. do not apply to all algorithm-baseddisclosures considered in our paper.Likewise, the attacks we consider in the paper also gobeyond another assumption made by Cormodeet al.—i.e., each QI-group is generated independentlyfrom the others. Our attack actually leverages associationbetween QI-groups which can be learned from knowl-edge of the publishing algorithm. An example is theattack against Hilb which we discuss in Section 8.3.1.Since we consider attacks beyond the scope defined byCormode et al., our conclusion is also different fromtheirs. In particular, as we have shown in Section 8.3.1,our experimental results show that the impact of algo-rithm-based disclosure can be serious when an adver-sary exploits QI-group-correlations—e.g., 86.6% tuplesout of the Census dataset can be compromised basedon the knowledge of the Hilb algorithm (in the case of6-diversity).

� In terms of defending against algorithm-based disclo-

sure (e.g., defense against minimality attacks consid-ered in Cormode et al.), the aforementioned‘‘symmetric’’ method in Cormode et al. weakens the‘-diversity guarantee to ð‘�2=3Þ-diversity (as one cansee from Theorem 5 in Cormode et al.). On the otherhand, both local and global look-ahead algorithmsproposed in our paper can fix the algorithm-baseddisclosure problem without weakening the privacyguarantee. Meanwhile, we also propose stratified pick-up to further improve the utility of the published table.

It is also worth pointing out that we substantiallyextended our preliminary version [49]. First, we haveproposed a brand new tool called local look-ahead whichis designed to eliminate algorithm-based disclosure fromdata publishing algorithms based on SA-perturbation(e.g., MASK). This stands in contrast with [49] which canonly deal with QI-perturbation algorithms that keep SAintact. We have also added the corresponding experi-ments to illustrate the effectiveness of local look-ahead.Another significant addition is an evaluation of the extentof algorithm-based disclosure in the original MASK algo-rithm and the state-of-the-art ‘-diversity algorithm Hilb.

10. Conclusion

This paper addressed the problem of algorithm-baseddisclosure in privacy-preserving data publishing. We pro-posed Algorithm-SAfe Publishing (ASAP), a novel privacymodel which defines the space of algorithm-based disclo-sure. Two necessary conditions and two sufficient condi-tions of ASAP were given as a toolset to determine whether

Page 22: ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing

X. Jin et al. / Information Systems 36 (2011) 859–880880

an existing algorithm is vulnerable to algorithm-baseddisclosure. To eliminate algorithm-based disclosure, weproposed global and local look-ahead, two generic tools forcorrecting the design of existing algorithms. To enhanceutility, we developed another add-on tool: stratified pick-up. We demonstrated the power of our tools by revisingthe design of three existing algorithms, Mondrian, Hilb,MASK, to Mondrianþþ, Hilbþþ and MASKþþ, respectively,for eliminating algorithm-based disclosure. We conductedextensive experiments on real-world datasets to demon-strate the effectiveness, efficiency and utility of our tools.

References

[1] P. Samarati, L. Sweeney, Protecting privacy when disclosing infor-mation: k-anonymity and its enforcement through generalizationand suppression, Technical Report, CMU, SRI, 1998.

[2] A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam,‘-diversity: privacy beyond k-anonymity, in: ICDE, 2006.

[3] R.C. Wong, J. Li, A.W. Fu, K. Wang, (a, k)-anonymity: an enhancedk-anonymity model for privacy-preserving data publishing, in:KDD, 2006, pp. 754–759.

[4] N. Li, T. Li, S. Venkatasubramanian, t-Closeness: privacy beyondk-anonymity and ‘-diversity, in: ICDE, 2007, pp. 106–115.

[5] Q. Zhang, N. Koudas, D. Srivastava, T. Yu, Aggregate query answer-ing on anonymized tables, in: ICDE, 2007, pp. 116–125.

[6] X. Xiao, Y. Tao, m-Invariance: towards privacy preservingre-publication of dynamic datasets, in: SIGMOD, 2007, pp. 689–700.

[7] K. LeFevre, D.J. DeWitt, R. Ramakrishnan, Incognito: efficient full-domain k-anonymity, in: SIGMOD, 2005, pp. 49–60.

[8] K. LeFevre, D.J. DeWitt, R. Ramakrishnan, Mondrian multidimen-sional k-anonymity, in: ICDE, 2006, pp. 25–35.

[9] K. LeFevre, D.J. DeWitt, R. Ramakrishnan, Workload-aware anon-ymization, in: KDD, 2006, pp. 277–286.

[10] X. Xiao, Y. Tao, Anatomy: simple and effective privacy preservation,in: VLDB, 2006, pp. 139–150.

[11] G. Ghinita, P. Karras, P. Kalnis, N. Mamoulis, Fast data anonymiza-tion with low information loss, in: VLDB, 2007, pp. 758–769.

[12] R.C. Wong, A.W. Fu, K. Wang, J. Pei, Minimality attack in privacy-preserving data publishing, in: VLDB, 2007, pp. 543–554.

[13] A. Kerckhoffs, La cryptographie militaire (Military Cryptography),Journal des sciences militaires IX (1883) 5–83 .

[14] L. Zhang, S. Jajodia, A. Brodsky, Information disclosure underrealistic assumptions: privacy versus optimality, in: CCS, 2007.

[15] C. Dwork, Differential privacy, in: ICALP, 2006, pp. 1–12.[16] A. Machanavajjhala, J. Gehrke, M. Goetz, Data publishing against

realistic adversaries, in: VLDB, 2009.[17] N. Koudas, D. Srivastava, T. Yu, Q. Zhang, Distribution-based

microdata anonymization, in: VLDB, 2009.[18] M.F. Mokbel, C.Y. Chow, W.G. Aref, The New Casper: query proces-

sing for location services without compromising privacy, in: VLDB,2006.

[19] K. Liu, E. Terzi, Towards identity anonymization on graphs, in:SIGMOD, 2009.

[20] H. Yeye, J. Naughton, Anonymization of set-valued data via top-down local generalization, in: VLDB, 2009.

[21] T. Li, N. Li, Injector: mining background, knowledge for dataanonymization, in: ICDE, 2008, pp. 446–455.

[22] M.E. Nergiz, M. Atzori, C. Clifton, Hiding the presence of individualsfrom shared databases, in: SIGMOD, 2007, pp. 665–676.

[23] D.J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, J. Halpern,Worst-case background knowledge for privacy-preserving datapublishing, in: ICDE, 2007, pp. 126–135.

[24] C.E. Shannon, Communication theory of secrecy systems, BellSystem Technical Journal 28 (1949) 656–715.

[25] T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley-Interscience, 1991.

[26] R.J. Bayardo, R. Agrawal, Data privacy through optimal k-anonymi-zation, in: ICDE, 2005.

[27] V.S. Iyengar, Transforming data to satisfy privacy constraints, in:KDD, 2002.

[28] D. Kifer, J. Gehkre, Injecting utility into anonymized datasets, in:SIGMOD, 2006.

[29] L. Sweeney, Achieving k-anonymity privacy protection using gen-eralization and suppression, International Journal on Uncertainty,Fuzziness and Knowledge-based Systems 10 (5) (2002) 571–588.

[30] X. Xiao, Y. Tao, Personalized privacy preservation, in: SIGMOD,2006, pp. 229–240.

[31] J. Li, Y. Tao, X. Xiao, Preservation of proximity privacy in publishingnumeric sensitive data, in: SIGMOD, 2008, pp. 473–486.

[32] V.S. Iyengar, Transforming data to satisfy privacy constraints, in:KDD, 2002, pp. 279–288.

[33] A. Meyerson, R. Williams, On the complexity of optimal k-anon-ymity, in: PODS, 2004, pp. 223–228.

[34] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy,D. Thomas, A. Zhu, Anonymizing tables, in: ICDT, 2005, pp. 246–258.

[35] B.C.M. Fung, K. Wang, P.S. Yu, Top-down specialization for informa-tion and privacy preservation, in: ICDE, 2005, pp. 205–216.

[36] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy,D. Thomas, A. Zhu, Achieving anonymity via clustering, in: PODS,2006, pp. 153–162.

[37] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, A.W. Fu, Utility-basedanonymization using local recoding, in: KDD, 2006, pp. 785–790.

[38] H. Park, K. Shim, Approximate algorithms for k-anonymity, in:SIGMOD, 2007, pp. 67–78.

[39] T. Iwuchukwu, J. Naughton, k-Anonymization as spatial indexing:toward scalable and incremental anonymization, in: VLDB, 2007,pp. 746–757.

[40] G. Ghinita, Y. Tao, P. Kalnis, On the anonymization of sparse high-dimensional data, in: ICDE, 2008, pp. 715–724.

[41] X. Jin, M. Zhang, N. Zhang, G. Das, Versatile publishing for privacypreservation, in: KDD, 2010.

[42] B. Chen, R. Ramakrishnan, K. LeFevre, Privacy skyline: privacy withmultidimensional adversarial knowledge, in: VLDB, 2007, pp. 770–781.

[43] W. Du, Z. Teng, Z. Zhu, Privacy-MaxEnt: integrating background knowl-edge in privacy quantification, in: SIGMOD, 2008, pp. 459–472.

[44] T. Li, N. Li, J. Zhang, Modeling and integrating background knowl-edge in data anonymization, in: ICDE, 2009.

[45] Y. Tao, X. Xiao, J. Li, D. Zhang, On anti-corruption privacy preservingpublication, in: ICDE, 2008, pp. 725–734.

[46] V. Rastogi, S. Hong, D. Suciu, The boundary between privacy andutility in data publishing, in: VLDB, 2007, pp. 531–542.

[47] W. Liu, L. Wang, L. Zhang, k-Jump strategy for preserving privacy inmicro-data disclosure, in: ICDT, 2010.

[48] G. Cormode, N. Li, T. Li, D. Srivastava, Minimizing minimality andmaximizing utility: analyzing method-based attacks on anon-ymized data, in: VLDB, 2010.

[49] X. Jin, N. Zhang, G. Das, Algorithm-safe privacy preserving datapublishing, in: EDBT, 2010.