MAGE: A semantics retaining K-anonymization method for mixed data

12
MAGE: A semantics retaining K-anonymization method for mixed data Jianmin Han a,, Juan Yu b , Yuchang Mo a , Jianfeng Lu a , Huawen Liu a a Department of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China b Department of Computer Science and Technology, Fudan University, Shanghai 200433, China article info Article history: Received 15 July 2011 Received in revised form 5 October 2013 Accepted 6 October 2013 Available online 17 October 2013 Keywords: K-anonymity Generalization Microaggregation Privacy preservation abstract K-anonymity is a fine approach to protecting privacy in the release of microdata for data mining. Microaggregation and generalization are two typical methods to implement k-anonymity. But both of them have some defects on anonymizing mixed microdata. To address the problem, we propose a novel anonymization method, named MAGE, which can retain more semantics than generalization and microaggregation in dealing with mixed microdata. The idea of MAGE is to combine the mean vector of numerical data with the generalization values of categorical data as a clustering centroid and to use it as incarnation of the tuples in the corresponding cluster. We also propose an efficient TSCKA algorithm to anonymize mixed data. Experimental results show that MAGE can anonymize mixed microdata effectively and the TSCKA algorithm can achieve better trade-off between data quality and algorithm efficiency comparing with two well-known anonymization algorithms, Incognito and KACA. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction Microdata, i.e., data containing individual-specific information, play an increasingly important role in data analysis and scientific research. Therefore many organizations are increasingly collecting and publishing microdata. However, publishing microdata may threaten individuals’ privacy, because it is still possible to link the released data with other public or easy-to-access database to re-identify individuals’ identities, although attributes which can obviously identify individuals’ identities have been removed be- fore publication. Sweeney [1] pointed out that there were about 87% of Americans can be uniquely identified through the combina- tion of their gender, birthday and postcode. If an inpatient dataset containing these three attributes are released by medical organiza- tions, adversaries can re-identify the disease of a specific inpatient by linking with other external table, though the identity attributes have been removed before publication. To address the problem, Samarati [2] proposed the k-anonymity model. It requires that each tuple has at least k indistinguishable tuples with respect to quasi-identifier in the released data. The quasi-identifier is a combination of some attributes which also ex- ist in other available public microdata, and can be used to re-iden- tify individuals’ identities by linking the public microdata with the released data. The k-anonymity model can protect individuals’ identities from being re-identified with high probability by adver- saries. Due to the practicability of k-anonymity, it has been studied extensively in recent years [3–16]. Generalization and microaggregation are two typical techniques to achieve k-anonymity. Generalization was proposed by Samarati [2]. The idea of generalization is to replace real value of quasi-identifier with less specific but semantically consistent value. Microaggregation is a statistical disclosure control (SDC) method. The idea of microaggregation is to construct small clusters from the data (each cluster should have at least k elements), and then to replace each original data with the centroid of its corre- sponding cluster. Generalization is good at dealing with categorical data, as it generally generalizes the original data into more seman- tics-retaining data. Microaggregation is suitable to k-anonymize numerical data, because it adopts mean vector to replace numeri- cal values, which can preserve more numerical semantic informa- tion than generalization. However, when it comes to mixed data consisting of numerical data and categorical data, neither general- ization nor microaggregation can effectively k-anonymize them. 1.1. Limitations of generalization on k-anonymizing mixed data For numerical data, generalization is not an appropriate k-anonymizing method [4]. As it k-anonymizes numerical data in the same way as categorical data, i.e., it replaces a numerical value x with a continuous range [a, b] contained x, labeled by a L [a,b] , generalization would lose much semantic information of numeri- cal data. For example, analysts cannot infer from L [a,b] whether the original numerical values are mostly in the lower half of [a, b] or in its upper half. In addition, generalization is time-consuming and loses considerable information for dealing with microdata with large size quasi-identifier microdata. Aggarwal [3] claimed that generalization may suffer from the curse of dimensionality. 0950-7051/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.knosys.2013.10.009 Corresponding author. Tel.: +86 579 82283528. E-mail address: [email protected] (J. Han). Knowledge-Based Systems 55 (2014) 75–86 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Transcript of MAGE: A semantics retaining K-anonymization method for mixed data

Page 1: MAGE: A semantics retaining K-anonymization method for mixed data

Knowledge-Based Systems 55 (2014) 75–86

Contents lists available at ScienceDirect

Knowledge-Based Systems

journal homepage: www.elsevier .com/ locate /knosys

MAGE: A semantics retaining K-anonymization method for mixed data

0950-7051/$ - see front matter � 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.knosys.2013.10.009

⇑ Corresponding author. Tel.: +86 579 82283528.E-mail address: [email protected] (J. Han).

Jianmin Han a,⇑, Juan Yu b, Yuchang Mo a, Jianfeng Lu a, Huawen Liu a

a Department of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, Chinab Department of Computer Science and Technology, Fudan University, Shanghai 200433, China

a r t i c l e i n f o a b s t r a c t

Article history:Received 15 July 2011Received in revised form 5 October 2013Accepted 6 October 2013Available online 17 October 2013

Keywords:K-anonymityGeneralizationMicroaggregationPrivacy preservation

K-anonymity is a fine approach to protecting privacy in the release of microdata for data mining.Microaggregation and generalization are two typical methods to implement k-anonymity. But both ofthem have some defects on anonymizing mixed microdata. To address the problem, we propose a novelanonymization method, named MAGE, which can retain more semantics than generalization andmicroaggregation in dealing with mixed microdata. The idea of MAGE is to combine the mean vectorof numerical data with the generalization values of categorical data as a clustering centroid and to useit as incarnation of the tuples in the corresponding cluster. We also propose an efficient TSCKA algorithmto anonymize mixed data. Experimental results show that MAGE can anonymize mixed microdataeffectively and the TSCKA algorithm can achieve better trade-off between data quality and algorithmefficiency comparing with two well-known anonymization algorithms, Incognito and KACA.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction

Microdata, i.e., data containing individual-specific information,play an increasingly important role in data analysis and scientificresearch. Therefore many organizations are increasingly collectingand publishing microdata. However, publishing microdata maythreaten individuals’ privacy, because it is still possible to linkthe released data with other public or easy-to-access database tore-identify individuals’ identities, although attributes which canobviously identify individuals’ identities have been removed be-fore publication. Sweeney [1] pointed out that there were about87% of Americans can be uniquely identified through the combina-tion of their gender, birthday and postcode. If an inpatient datasetcontaining these three attributes are released by medical organiza-tions, adversaries can re-identify the disease of a specific inpatientby linking with other external table, though the identity attributeshave been removed before publication.

To address the problem, Samarati [2] proposed the k-anonymitymodel. It requires that each tuple has at least k indistinguishabletuples with respect to quasi-identifier in the released data. Thequasi-identifier is a combination of some attributes which also ex-ist in other available public microdata, and can be used to re-iden-tify individuals’ identities by linking the public microdata with thereleased data. The k-anonymity model can protect individuals’identities from being re-identified with high probability by adver-saries. Due to the practicability of k-anonymity, it has been studiedextensively in recent years [3–16].

Generalization and microaggregation are two typicaltechniques to achieve k-anonymity. Generalization was proposedby Samarati [2]. The idea of generalization is to replace real valueof quasi-identifier with less specific but semantically consistentvalue. Microaggregation is a statistical disclosure control (SDC)method. The idea of microaggregation is to construct small clustersfrom the data (each cluster should have at least k elements), andthen to replace each original data with the centroid of its corre-sponding cluster. Generalization is good at dealing with categoricaldata, as it generally generalizes the original data into more seman-tics-retaining data. Microaggregation is suitable to k-anonymizenumerical data, because it adopts mean vector to replace numeri-cal values, which can preserve more numerical semantic informa-tion than generalization. However, when it comes to mixed dataconsisting of numerical data and categorical data, neither general-ization nor microaggregation can effectively k-anonymize them.

1.1. Limitations of generalization on k-anonymizing mixed data

For numerical data, generalization is not an appropriatek-anonymizing method [4]. As it k-anonymizes numerical data inthe same way as categorical data, i.e., it replaces a numerical valuex with a continuous range [a,b] contained x, labeled by a L[a,b],generalization would lose much semantic information of numeri-cal data. For example, analysts cannot infer from L[a,b] whetherthe original numerical values are mostly in the lower half of [a,b]or in its upper half. In addition, generalization is time-consumingand loses considerable information for dealing with microdatawith large size quasi-identifier microdata. Aggarwal [3] claimedthat generalization may suffer from the curse of dimensionality.

Page 2: MAGE: A semantics retaining K-anonymization method for mixed data

76 J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86

1.2. Limitations of microaggregation on k-anonymizing mixed data

For categorical data, microaggregation is not well suitable. As itreplaces categorical data with mode in its correspondent cluster inthe k-anonymizing procedure, microaggregation may result ineliminating some low frequent values of microdata. For example,considering a dataset which has two attributes: illness and gender,the illness’s value is ‘‘adiposity’’. Suppose there are 70% females and30% males in the dataset. Microaggregation k-partitions the origi-nal dataset into some equivalence classes whose size is betweenk and 2k-1. If the distribution of gender values in all these equiva-lence classes is proportional, the gender’s values are all changedinto ‘‘female’’. Thus, analyst would obtain wrong information fromthe anonymous dataset that all the ‘‘adiposity’’ patients are‘‘female’’.

In addition, distance measurement used by existing microag-gregation algorithms cannot capture the practical semantics of cat-egorical data. For example, the distance of two zip code{4661,4663}is the same as the distance of two zip code{4661,4331} in microag-gregation, and both of them are 1. But in terms of their practicalsemantics, the distance of the two zip code{4661,4663} should bemore closer than that of {4661,4331}.

We have adopted microaggregation technology to anonymizemixed data in literature [5]. From the work we know that in anon-ymous dataset by microaggregation, some important values areeliminated and distributions of values are not consistent with thatin original data.

1.3. Contributions and paper outline

As we mentioned above, both generalization and microaggrega-tion have some limitations on k-anonymizing mixed data. To ad-dress this problem, we propose the MAGE (MicroAggregation andGEneralization) method by combining the two methods. We usegeneralization to anonymize categorical data and use microaggre-gation to anonymize numerical data. Our contributions include: (1)we define distance measurement for mixed data, which is general-ization distance for categorical data and Euclidean distance fornumerical data; (2) we propose a method to anonymize microdatawhich integrates the mean vector of numerical data and the closestcommon generalization of categorical data as group centroid andto use it as incarnation of tuples in the corresponding cluster;and (3) we propose an efficient TSCKA (Two Steps ClusteringK-anonymization Algorithm) to k-anonymize mixed data.

The rest of the paper is organized as follows. Section 2 intro-duces preliminary concepts. Section 3 proposes an MAGE methodcombining generalization and microaggregation oriented to mixeddata, and presents an evaluation method for K-anonymizationquality of mixed data. Section 4 proposes a TSCKA to achievek-anonymity for mixed data. Section 5 makes some comparisonsand analyses on experimental results. Section 6 briefly discussesrelated work. Section 7 concludes the work.

2. Preliminaries

2.1. K-anonymity

Attributes in microdata can be divided into three categories: ex-plicit identifiers, quasi-identifier and sensitive attributes. Explicitidentifiers such as name or social security number are attributes thatcan uniquely identify individuals, which should be removed orencrypted prior to publishing. Quasi-identifier cannot identifyindividuals by themselves, but can disclose individuals’ privacyby linking with external table. Sensitive attributes are a set of

attributes which contain individuals’ sensitive information, e.g.,salary, religion, health status and so on.

Definition 1 (Anonymous Equivalence Class). An anonymous equiv-alence class of a table with respect to an attribute set is a set oftuples in which all tuples contain identical values for the attributeset.

For example, tuples 1, 2, 3 in Table 1(b) form an equivalenceclass with respect to the attribute set {Gender,Age,Pcode}, whosecorresponding values are identical. We generally name theattribute set as quasi-identifier.

Definition 2 (K-anonymity). Let T(A1,A2, . . . ,An) be a table and QIbe the quasi-identifier of table T. We call that table T satisfiesk-anonymity if and only if each sequence of values in T[QI] appearswith at least k occurrences, where T[QI] denotes the projection ofattributes of QI in T, maintaining duplicate tuples.

K-anonymity requires that the size of each equivalence classwith respect to quasi-identifier is at least k. For example,Table 1(b and c) are 2-anonymity tables of Table 1(a), where qua-si-identifier is {Gender,Age,Pcode}. Generally, a table has more thanone k-anonymous views, but some are better than others. Forexample, Table 1(b and c) are 2-anonymous views of Table 1(a).Obviously, Table 1(c) loses much more details thanTable 1(b), i.e., the data utility of Table 1(b) is better than that ofTable 1(c). It has been proved that the size of each equivalenceclass should be in [k,2k � 1] in order to retain as more data utilityas possible.

2.2. Generalization

2.2.1. Basic conceptsGeneralization is one of the most popular approaches for

k-anonymizing microdata. The main idea of generalization is to re-place the original values of quasi-identifier with less specific valueswhich are semantically consistent with the original values, and theprocedure is called generalization. Quasi-identifier could consist ofboth numerical attributes and categorical attributes. Generally, weuse different generalization methods to generalize different kindsof attributes, i.e., generalizing categorical attribute values to a setof distinct values or a single less specific value, whereas generaliz-ing numerical attribute values to intervals which contain theoriginal values. For example, by stripping the rightmost digit, thethree categorical attribute values {4661,4663,4663} for zip codecan be generalized to 466� which semantically represents a largergeographical area, whereas the three numerical attribute values{35,36,37} are generally generalized to an interval [30,39].

Actually, the generalization operation can be done in twodifferent ways, namely global recoding and local recoding. Globalrecoding, also called domain generalization, is processed at domainlevel based on a predefined domain generalization hierarchy treedenoted by DGHT. For global recording, once an attribute value isgeneralized, each occurrence of the value should be replaced bythe new generalized value. Table 1(c) gives a global recoding ver-sion of the original table. It is easy to see that global recodingmay over-generalize microdata. Differently, local recoding, alsocalled value generalization, generalizes attribute values at cell levelbased on a predefined value generalization hierarchy tree denotedby VGHT. Obviously, local recoding does not need all occurrences ofa value to be changed into the same generalized value, whereas itallows a generalized attribute value co-exists with the originalvalue within the same generalized microdata. Table 1(b) gives alocal recoding version. Different from global recoding, local recod-ing does not over-generalize a table, hence it may minimize the

Page 3: MAGE: A semantics retaining K-anonymization method for mixed data

Table 1Global recoding and local recoding example.

(a) An original table (b) A 2-anonymous view by local recoding (c) A 2-anonymous view by global recoding

Gender Age Pcode Problem Gender Age Pcode Problem Gender Age Pcode Problem

Female 35 4661 Stress ⁄ [30,39] 466⁄ Stress ⁄ [20,39] 466⁄ StressMale 36 4663 Obesity ⁄ [30,39] 466⁄ Obesity ⁄ [20,39] 466⁄ ObesityFemale 37 4663 Obesity ⁄ [30,39] 466⁄ Obesity ⁄ [20,39] 466⁄ ObesityFemale 21 4354 Stress Female [20,v29] 4354 Stress ⁄ [20,39] 435⁄ StressFemale 25 4354 Obesity Female [20,29] 4354 Obesity ⁄ [20,39] 435⁄ ObesityFemale 55 4331 Stress Female [50,59] 4331 Stress ⁄ [40,59] 433⁄ StressFemale 57 4331 Obesity Female [50,59] 4331 Obesity ⁄ [40,59] 433⁄ ObesityFemale 67 4652 Stress ⁄ [60,69] 465⁄ Stress ⁄ [60,79] 465⁄ StressFemale 69 4653 Obesity ⁄ [60,69] 465⁄ Obesity ⁄ [60,79] 465⁄ ObesityMale 68 4653 Stress ⁄ [60,69] 465⁄ Stress ⁄ [60,79] 465⁄ StressMale 48 4354 Obesity Male [40,59] 4354 Obesity ⁄ [40,59] 435⁄ ObesityMale 54 4354 Stress Male [40,59] 4354 Stress ⁄ [40,59] 435⁄ Stress

J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86 77

distortion of a k-anonymous view. Table 1 gives a good example toillustrate the differences between global recoding and local recod-ing. Fig. 1 gives domain and value generalization hierarchies ofattributes Gender and Pcode.

Many local recoding algorithms have been proposed to achievek-anonymity, such as K-Anonymization by Clustering in Attributehierarchies algorithm (KACA) [6], Top-Down algorithm [7], etc.Among these algorithms, clustering-based generalization algo-rithms are a kind of fine local recoding methods for k-anonymizingmicrodata. The main idea of the clustering-based generalizationalgorithms is to partition original dataset into some equivalenceclasses based on the predefined distance, and then generalize alltuples in each equivalence class into their common generalizedtuples. To achieve clustering, we need to define somemeasurements to measure the distance between tuples. To the bestof our knowledge, the weighted hierarchical distance [6] based ongeneralized tree is a reasonable measurement for generalizationdistance between tuples. We define the generalization distancewhich is similar to the weighted hierarchical distance in nextsection.

2.2.2. Distance measurement in generalizationIn order to define the distance between tuples in generalization

table, we need to define the concepts of closest common general-ization, common generalization tuple, and distortion of generaliza-tion of tuple. In this section, we first give the definitions of thethree concepts, and then define the generalization distance be-tween two tuples.

Definition 3 (Closest Common Generalization). Let A be an attributeof table T, VGHTA be the value generalization hierarchy tree of A, a1

and a2 be two values of attribute A. The closest common gener-alization of a1 and a2 (denoted by CCG (a1,a2)) is defined as

CCGða1;a2Þ¼a1 if a1¼a2;the closest ancestor of a1 and a2 in VGHTA otherwise

ð1Þ

(a)DGHGender (b)VGHGender (c) DGHP

g1={*}

g0={male,female}

*

male female

z4={*}

z3={4***}

z2={43**,46**

z1={435*,433*

z0={4354,4331 4653,4661}

Fig. 1. Domain and value generalizatio

Definition 4 (Common Generalization Tuple). Let ti and tj be twotuples of table T, QI = {A1, . . . ,Aq} be the quasi-identifier of T, t.Ai

be the value of tuple t on attribute Ai. Common generalization tuple(CGT) t0 of ti and tj is defined as (2).

t0 ¼ ðCCGðti � A1; tj � A1Þ; . . . ;CCGðti � Aq; tj � AqÞÞ ð2Þ

Definition 5 (Distortions of Generalization of Tuples). Let t be atuple of table T, t0 be generalization tuple of t, QI = {A1, . . . ,Aq} bequasi-identifier of T, t � Ai be the value of tuple t on attribute Ai,VGHTA1 ; . . . ;VGHTAq be the value generalization hierarchy trees ofQI. The distortion of t generalized to t0 is defined as (3).

distortionðt; t0Þ ¼¼Xq

i¼1

xAi� levelðt0 � Ai � 1Þ

hðVGHTAiÞ ð3Þ

where hðVGHTAiÞ denotes the height of VGHTAi

; levelðt0 � AiÞ denotesthe level of t0 � Ai on VGHTAi

;xAidenotes weight of attribute Ai.

For example, let attribute Gender be in hierarchy of {male/female,�}, attribute Pcode be in hierarchy of {dddd,ddd⁄,dd⁄⁄,d⁄⁄⁄, ⁄}, xAi

be 1. t1={female,4661} and t01 ¼ f�, 466⁄}. Thendistortion t1; t01

� �¼ 1=2þ 1=5 ¼ 0:7.

Definition 6 (Generalization Distance between Two Tuples). LetT(A1,A2, . . . ,An) be a table and QI = {A1, . . . ,Aq} be the quasi-identifierof table T. Given two tuples ti and tj of T and their commongeneralization tuple t0, the generalization distance between ti and tj

is defined as (4).

distgenðti; tjÞ ¼ distortionðti; t0Þ þ distortionðtj; t0Þ ð4Þ

For example, let attribute Gender be in hierarchy of {male/female,�}, attribute Pcode be in hierarchy of {dddd,ddd⁄,dd⁄⁄,d⁄⁄⁄, ⁄}, xAi

be 1. t1 = {female,4661} and t2 = {male,4663},t0 = {⁄,466⁄}.Distgen(t1, t2) = distortion(t1, t0) + distortion(t2, t0) = 0.7 + 0.7 = 1.4.

code (d) VGHPcode

}

,465*,466*}

, 4652, ,4663}

*

4***

43** 46**

435* 433* 465* 466*

4354 4331 4652 4653 4663 4661

n hierarchies of Gender and Pcode.

Page 4: MAGE: A semantics retaining K-anonymization method for mixed data

78 J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86

2.3. Microaggregation

2.3.1. Basic conceptsMicroaggregation is another popular approach for k-anonymiz-

ing microdata, which is a family of SDC methods for microdata. Theprocedure of microaggregation can be divided into two steps: (1)k-partition: the original dataset is partitioned into several clustersso that tuples in the same cluster are as similar to each other aspossible, and the tuple number of each cluster is no less than k,where k is a constant value specified by the k-anonymizing system.(2) Aggregation: clustering centroid (for example, the mean vectorfor numerical data or the mode vector for categorical data) of eachcluster generated by the first step is computed in this step, andthen all tuples of each cluster are transformed into their corre-sponding cluster centroid. Table 2 gives a 2-anonymity table gen-erated by microaggregation from Table 1(a). Domingo-Ferrer andMateo-Sanz [8] have proved that the cluster size of optimal k-par-tition is between k and 2k � 1.

2.3.2. Distance measurement in microaggregationDistance measurement in microaggregation is different from

that in generalization. As attributes in a table can be divided intonumerical attributes and categorical attributes, different measure-ments are used for different types of attributes, i.e., the distancemeasurement of numerical attributes is different from that of cat-egorical attributes. For numerical attributes, we adopt Euclideandistance to measure the distance of two attribute values. For cate-gorical attributes, as it can be further divided into ordinal attri-butes and nominal attributes, we use formula (5) to measure thedistance between ordinal attribute values and use formula (6) tomeasure the distance between nominal attribute values.

Definition 7 (Distance between Two Ordinal Attribute Values). Leta1 and a2 be two values of an ordinal attribute. The distancebetween a1 and a2 is defined as (5).

dOrdinalða1; a2Þ ¼jfjja1 6 j < a2gj

jDðAÞj ð5Þ

where jD(A)j is the number of distinct values of attribute A.

For example, Attribute A has 5 distinct values: {first,second,third, fourth,fifth}. If a1 = ‘first’ and a2 = ‘third’, thendordinal(a1,a2) = 2/5 = 0.4.

Definition 8 (Distance between Two Nominal Attribute Values). Leta1 and a2 be two values of a nominal attribute. The distancebetween a1 and a2 is defined as (6).

dNominalða1; a2Þ ¼0 if a1 ¼ a2

1 otherwise

�ð6Þ

Table 2A 2-anonymity table from Table 1(a) by microaggregation.

Gender Age Pcode Problem

Female 36 4663 ObesityFemale 36 4663 StressFemale 36 4663 ObesityFemale 23 4354 StressFemale 23 4354 ObesityFemale 56 4331 StressFemale 56 4331 ObesityFemale 68 4653 StressFemale 68 4653 ObesityFemale 68 4653 StressMale 52 4354 ObesityMale 52 4354 Stress

Definition 9 (Microaggregation Distance between Two Tuples). Let Tbe a table and QI = {A1, . . . ,Aq} be the quasi-identifier of T, t � Ai bevalue of tuple t on attribute Ai. Assume that attributes A1, . . . , As

(s 6 q) are numerical, and the rest attributes are categorical. Giventwo tuples ti and tj, the distance between ti and tj is defined as (7).

distMicðti; tjÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXs

l¼1ðti � Al � tj � AlÞ2

qþXq

m¼sþ1

dðti � Am; tj � AmÞ ð7Þ

where ti � Am and tj � Am denote the mth attribute value of ti and tj

respectively, and

dðti � Am; tj � AmÞ ¼dOrdinalðti � Am; tj � AmÞ if Am is ordinaldNominalðti � Am; tj � AmÞ if Am is nominal

�ð8Þ

3. MAGE for mixed data

3.1. Overview

From Section 2 we know that neither generalization nor micro-aggregation can anonymize mixed data effectively. To address theproblem, we propose a MAGE method. MAGE first partitions origi-nal microdata into some equivalence classes, whose size is muchlarger than k, based on semantic similarity. Then the MAGE con-structs the centroid for each equivalence class and uses it as incar-nation of each tuple in its corresponding equivalence class toanonymize the dataset. The centroid constructed by MAGE consistsof the mean vector of numerical data and the closest common gen-eralization of categorical data. For example, Table 3 is a k-anonym-ity table from Table 1(a) by MAGE.

MAGE has the following advantages over generalization andmicroaggregation:

(1) As the proposed MAGE uses the mean vector of numericaldata to replace original numerical data, it can preserve morenumerical semantics than generalization method. For exam-ple, from Table 3, we can see that the value 36 of age is moresimilar to the original values {35,36,37} in the first equiva-lence class, and it can preserve more numerical semanticsthan generalization value [30,39] in Table 1(b).

(2) As the proposed MAGE uses generalization values to replaceoriginal categorical data, it can preserve more semantics forcategorical data than microaggregation method. For exam-ple, from Table 3 we can see that the generalization value466⁄ of Pcode is semantically consistent with the originalvalues. However, the value of Pcode in the first equivalenceclass using microaggregation in Table 2 is changed into4663, which is inconsistent with the original value 4661.What is more, microaggregation may change the distribu-tion of many values. For example, the ratio ‘‘man’’ and

Table 3A 2-anonymity table from Table 1(a) by MAGE.

Gender Age Pcode Problem

⁄ 36 466⁄ Obesity⁄ 36 466⁄ Stress⁄ 36 466⁄ ObesityFemale 23 4354 StressFemale 23 4354 ObesityFemale 56 4331 StressFemale 56 4331 Obesity⁄ 68 465⁄ Stress⁄ 68 465⁄ Obesity⁄ 68 465⁄ StressMale 52 4354 ObesityMale 52 4354 Stress

Page 5: MAGE: A semantics retaining K-anonymization method for mixed data

Table 4description of the Adult dataset.

No. Attribute Type Distinct values Height

1 Age Numeric 74 52 fhlweigh Numeric 100 53 Education_num Numeric 16 44 Hours_per_week Numeric 99 55 Work class Categorical 8 46 Education Categorical 16 47 Marital Status Categorical 7 38 Race Categorical 5 29 Gender Categorical 2 2

10 Occupation Sensitive 14 /

J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86 79

‘‘female’’ is 1:2 in Table 1(a), but the ratio becomes 1:5 inTable 2. In addition, microaggregation may also eliminatesome low frequent values, which is not in favor of data min-ing on anonymous data. For example, values of Pcode 4661and 4652 are eliminated in Table 2.

(3) For nominal attribute values, the proposed MAGE uses gen-eralization distance defined in Definition 6 instead of micro-aggregation distance defined in Definition 9, so it canachieve better k-partition than microaggregation method.For example, given a set of Pcode {4661,4663,4331,4333},according generalization distance, distgen({4661},{4663})= 0.4, distgen({4331},{4333}) = 0.4, distgen({4331},{4661})= 1.2. {4661,4663,4331,4333} should be 2-partition intotwo clusters {4661,4663} and {4331,4333}. According tomicroaggregation distance, the distance of arbitrarypairwise values in {4661,4663,4331,4333} is 1, so {4661,4663,4331,4333} may be 2-partition into {4661,4331} and{4663,4333}, which is unreasonable in terms of semantics.

3.2. Distance measurement in MAGE

Definition 10 (Mixed distance). Let T be a table, QI = {A1, . . . ,Aq} bethe quasi-identifier of T, where {A1, . . . ,As}(s 6 q) are numericalattribute set and {As+1, . . . ,Aq} are categorical attribute set. Giventwo tuples ti and tj of T, and let do(ti, tj) be numerical attribute partdistance of ti and tj, which adopts Euclidean distance. Let dc becategorical attribute part distance of ti and tj, which adoptsgeneralization distance, seeing (4). Mixed distance can be definedas

dðti; tjÞ ¼ doðti; tjÞ þ f ðdcðti; tjÞÞ ð9Þ

where f(�) is a mapping function which can control the proportion ofnumerical attribute distance and categorical attribute distancein mixed data distance measurement. Let the range of do be[domin,domax], the range of dc be [dcmin,dcmax], then f(�) can be definedas

f ðdcÞ ¼NcNo

domin þdomax � domin

dcmax � dcmin� ðdc � dcminÞ

� �ð10Þ

where Nc is the number of categorical attributes, No is the numberof numerical attributes.

For example, in Table 3, there are 12 tuples, which contain onenumerical attribute age and two categorical attributes Gender andPcode. After normalizing the values of age, the first tuple t1 ={female,�0.7934,4661}, the second tuple t2={male,�0.7308,4663}. do(t1, t2) = (((�0.7934) � (�0.7308))2)1/2 = 0.0626, and dc(t1,t2) = (1/2 + 1/2) + (1/5 + 1/5) = 1.4. Similarly, we can calculate allpairwise tuple distances do and dc, then we can easily obtaindomin = 0.0626, domax = 3.0065, dcmin = 0, dcmax = 2.2. According toformula (9), we have dðt1; t2Þ ¼ doðt1; t2Þ þ f ðdcðt1; t2ÞÞ ¼ doðt1; t2Þ

þNc=N�oðdominþðdomax�dominÞ=ðdcmax�dcminÞ�ðdc�dcminÞÞ¼0. 0626+ 2/1⁄(0.0626 + (3.0065 � 0.0626)/(2.2 � 0)⁄ (1.4 � 0)) = 3.935.

Definition 11. (Centroid of equivalence class of mixed data). Letequivalence class Ci have nituples, ft1; t2 . . . tnig ¼ to

1; tc1

� ;

�to

2; tc2

� ; . . . ; to

ni; tc

ni

n og 2 Ci; to be the mean vector of to

1; to2; . . . ;

�to

nig; tc be the closest common generalization of tc

1; tc2; . . . ; tc

ni

n o,

then the centroid of Ci can be defined as {to, tc}.

For example, in Table 3, there are 5 equivalence classes, whosecentroids are {�,36,465�}, {female,23,4354}, {female,56,4331},{⁄,68,465�} and {male,52,4354} respectively.

3.3. Data utility measurement

Privacy preservation is one side of anonymization. The otherside is retaining high data utility so that the published data remainpractically useful. Generally, the data publisher does not know howthe published data will be analyzed by the recipient, so reasonabledata utility measurement is to measure ‘‘similarity’’ between origi-nal data and anonymous data. Distance is a fine method to mea-sure similarity between original data and anonymous data. Wename the distance as information loss and use it to measure datautility of anonymous data. The more information loss is, the worsethe data utility of anonymous data is.

MAGE adopts two different technologies to anonymize mixeddata, i.e., generalization for categorical attributes and microaggre-gation for numerical attributes, so MAGE uses different methods toevaluate data utility for categorical attributes and numericalattributes.

3.3.1. Data utility measurement for categorical attributesMAGE adopts generalization to anonymize categorical attribute

data, so we adopt sum of distortions of generalized tuples to mea-sure the information loss of anonymous data for categoricalattributes.

Definition 12 (Information loss of categorical attributes). Let tableT0 be generalized from table T, ti be the ith tuple in T and t0i begeneralized from tuple ti. The information loss (ILg) of T0 from T isdefined as (11).

ILg ¼XjTji¼1

distortion ti; t0i� �

ð11Þ

where distortion ti; t0i� �

is defined in formula (3).

3.3.2. Data utility measurement for numerical attributesMAGE adopts microaggregation to anonymize numerical attri-

bute data, whose information loss of anonymous data can be calcu-lated by the following methods.

Let Gi be an equivalence class after k-partition, the informationloss of Gi is defined as (12).

GSEðGiÞ ¼Xni

j¼1

dðXij;XiÞ ð12Þ

where Xi ¼Pni

j¼1Xij

�.ni, i.e. the mean value Xi for numerical data,

ni is the tuple number of the i � th cluster Gi, ni P k, Xij is the jth

tuple of Gi; dðXij;XiÞ is Euclidean distance of Xij and Xi.The information loss of anonymous data is defined as (13).

ILm ¼Xg

i¼1

GSEðGiÞ ¼Xg

i¼1

Xni

j¼1

dðXij;XiÞ ð13Þ

Page 6: MAGE: A semantics retaining K-anonymization method for mixed data

80 J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86

where g is the number of clusters in anonymous data, ni is the tuplenumber of cluster Gi; n ¼

Pgi¼1ni.

3.3.3. Data utility measurementAnonymous data generated by the MAGE method have two

parts, i.e., generalization values for categorical data and mean vec-tor for numerical data. We adopt formula (11) to evaluate theinformation loss (ILg) for categorical data and adopt formula (13)to evaluate the information loss ILm for numerical data, and thenuse the weighted average of ILm and ILg to evaluate informationloss for mixed data, seeing (14).

ILh ¼ a � ILm þ b � ILg ð14Þ

where a + b = 1.

4. TSCKA algorithm for mixed data

Local recoding algorithms can generate better quality anony-mous data, however their computing complexity is general high,especially for a large dataset. So we propose a Two Steps ClusteringK-anonymizing Algorithm (TSCKA) to improve the process perfor-mance. TSCKA is a two-phase algorithm. In the first phase, TSCKAuses c-prototypes algorithm [9] to partition the whole dataset intoseveral small clusters, whose size is much larger than k. The idea ofc-prototype is similar to c-means algorithm except that thealgorithm can process mixed data. We name the first phase asc_partition algorithm. In the second phase, TSCKA uses KACA [6]to k-anonymize these clusters generated by the first phase toimprove the data quality.

C_partition algorithm is described in Algorithm 1. The timecomplexity of c_partition algorithm is O(c⁄n⁄t), the same asc-means, where c is number of clusters, n is the tuple number ofa dataset, t is iteration times when clustering.

Algorithm 1. c_partition (T, c)

Input: T: a dataset to be clustered; c: the number of clusters.Output: {T1,T2, . . . ,Tc}: is the partition of TProcedure:

(1) randomly choose c tuples from T as the initial clustercenters;(2) for each tuple in T do(3) calculate mixed distances of the tuple to all clustercenters;(4) allocate it to a cluster Ti (i 2 {1 . . .c}) whose clustercenter is the nearest to it;(5) update the cluster center of Ti;(6) end for(7) repeat(8) for each tuple in T do(9) re-calculate mixed distances of the tuple to allcluster centers;(10) if its nearest cluster center belongs to anothercluster Tj (j 2 {1. . .c}, j – i) then(11) re-allocate it to Tj and update the cluster centersof Ti and Tj;(12) end if(13) end for(14) until each cluster does not change;(15) return {T1,T2, . . . ,Tc};

In the second step, KACA is a local recoding algorithm, whose

idea is to find an equivalence class of size smaller than k

randomly and to merge it with the closest equivalence class toa larger equivalence class with the smallest distortion. Repeatthe process until each equivalence class contains at least k tupleswith the same QI-values. The algorithm is described inAlgorithm 2.

Algorithm 2. KACA (T, QI, k)

Input: T: an original dataset; QI: a quasi-identifier of T; k: thek-anonymity constraint

Output: T0: a k-anonymous dataset from TProcedure:

(1) T0 = U;(2) form equivalence class set E = {E1,E2, . . . ,Em} according toQI-values from T;

// tuples with the same QI-values constitute anequivalence class Ei, Ei 2 E(3) for all Ei 2 E do(4) if jEijP k then(5) T0 = T0 [ Ei;(6) E = E � Ei;(7) end if(8) end for(9) repeat(10) randomly choose Ei from E;(11) calculate distances of Ei to all other equivalenceclasses both in E and T0;(12) find an equivalence class Ej(i – j) with the nearestdistance to Ei;(13) if jEjj + jEij < k(14) anonymize Ei and Ej to E⁄;(15) remove Ei and Ej;(16) add E⁄ into E;(17) else if jEjj + jEij < 2k(18) anonymize Ei and Ej to E⁄;(19) remove Ei and Ej;(20) T0 = T0 [ E⁄;(21) else(22) randomly get (k- jEij) tuples from Ej and form anequivalence class E0j;

(23) anonymize Ei and E0j to E⁄;

(24) remove Ei and E0j;(25) T0 = T0 [ E⁄;(26) end if(27) until E = U;(28) return T0;

The KACA algorithm is a bottom-top clustering algorithm. Let jEjbe the number of equivalence classes in step (2). Each iterationneeds to randomly choose one equivalence (step (10)), which takesO(1), calculate the pair wise distance (step (11)), which takes O(jEj)time, find the closest equivalence class (step (12)), which takesO(jEj) time, anonymize the equivalence class (step (13)–(26)),which takes O(1). Thus, the runtime of an iteration is O(jEj). Asthe KACA requires O(jEj) iterations, the overall runtime is O(jEj2).In the worst case, the values of each tuple on quasi-identifier in ini-tial T are different, then jEj = n. In this case, the time complexity ofKACA is O(n2).

TSCKA contains the above two steps, which can be described inAlgorithm 3.

Page 7: MAGE: A semantics retaining K-anonymization method for mixed data

J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86 81

Algorithm 3. TSCKA (T,QI,k)

Input: T: an original dataset; QI: a quasi-identifier of T; k: thek-anonymity constraint

Output: T0: a k-anonymized dataset of TProcedure:

(1) T0 = U;(2) set c = (n/t)1/2//t is iteration times of c_partition(3) {T1,T2, . . . ,Tc} = c_partition (T,c);

// c_Partition (T,c) partitions a large dataset T into cclusters(4) for each Ti (i = 1 . . . c) do(5) if jTi jP k then(6) T 0i ¼ KACA (Ti,QI,k);// call KACA k-anonymizes theclusters Ti

(7) T 0 ¼ T 0 [ T 0i;(8) else(9) E = suppress all tuples in Ti;(10) T0 = T0 [ E;(11) end if(12) end for(13) return T0;

Let n be the size of dataset T, c be the cluster number by thec_partition, t be the iteration times of the c_partition. The c_parti-tion algorithm partitions the whole dataset into clusters, i.e., {T1-

,T2, . . . ,Tc} in step (3), whose time complexity is O(c⁄n⁄t).Then KACA is used to k-anonymize each Ti in {T1,T2, . . . ,Tc}

achieved from step (3). The average size of cluster Ti is n/c. We con-sider the average case, i.e., average jEij of each Ti is n/c, so the timecomplexity of KACA to k-anonymize Ti is O((n/c)2). The TSCKA exe-cutes the KACA algorithm c times literately in step (4) to step (12),whose time complexity is O((n/c)2)⁄c.

Therefore, the time complexity of TSCKA is O(c⁄n⁄t) + O((n/c)2)⁄-

c = O(c⁄n⁄t + n2/c). c and t are far smaller than n, so the complexityof TSCKA is O(n2).

We analyze how to set the initial center number cfrom the timecomplexity of TSCKA to improve efficiency. Let f(c) be the time com-plexity function of TSCKA. From above, we know that f(c) = c⁄ n⁄

t + n2/c, so f0(c) = n⁄t � n2/c2. In order to minimize f(c), n⁄t � n2/c2

should be 0, so c = (n/t)1/2.For example, let T be an original dataset, which contains 50,000

tuples, let iteration number t be 5. We can get c = (50,000/5)1/2 =100. In the first phase, the TSCKA uses c_partition algorithm to par-tition the whole dataset T into 100 clusters, whose running time isabout O(c⁄n⁄t) = 25,000,000. In the second phase, TSCKA uses

5 10 15 20 25 300

10

20

30

40

50

60

(a) |QI|=6, |D|=45222

Exe

cutio

n tim

e(Se

cond

)

k

TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

Fig. 2. Execution tim

KACA to anonymize the 100 clusters, whose running time is aboutO(n2/c) = 25,000,000. The whole runtime is 50,000,000. If we useKACA to anonymize original data, the runtime isO(n2) = 2,500,000,000. So TSCKA is more efficient than KACA.

5. Experiments

In this section, we evaluate the MAGE method and the TSCKAalgorithm in terms of efficiency and data utility under variousparameter settings. Data utility is measured by the informationloss of anonymous data. Information loss is defined in Section3.3. The more information loss is, the worse the data utility ofanonymous data is.

5.1. Experimental setup and data

We use the Adult dataset from the UCI Machine Learning Repos-itory downloadable at http://archive.ics.uci.edu/ml/datasets/Adult.The Adult dataset has become a benchmark dataset for comparinganonymity algorithm [5,6,9–11,25,28,29,31]. After eliminating thetuples with unknown values, the resulting dataset contains 45,222tuples. We used nine attributes of the dataset, as shown in Table 4.

All the Experiments were conducted on a PC with 2.30 GHz IntelCore i5 CPU and 4.0G RAM. Each algorithm takes three parameters,the k-anonymity parameter k, the dataset size jDj, and the numberof quasi-identifiers jQIj. The default parameters are k = 5 or k = 10,jQIj = 6 or jQIj = 8, and jDj = 45,222. For each experiment, we reportthe performance of all the algorithms with one of these parametersvarying while the others are fixed. We set c = (n/6)1/2 in step (2) ofTSCKA algorithm, because we observed that iteration number t wasabout 6 in experiments. From Section 4, we know that when c =(n/6)1/2, the TSCKA algorithm is the most efficient.

5.2. Comparison of methods

In this subsection, we will compare the MAGE method with ageneralization method and a microaggregation method in termsof efficiency and data utility. We implemented the generalizationalgorithm and the microaggregation algorithms by modifying theTSCKA algorithm, denoted as TSCKA-GE algorithm and TSCKA-MAalgorithm respectively. To make the comparison clearly, in thissection we denote the proposed the TSCKA which implementsMAGE as TSCKA-MAGE.

5.2.1. EfficiencyIn this set of experiments, we compare TSCKA-MAGE algorithm

with TSCKA-GE algorithm and TSCKA-MA algorithm in terms ofexecution time. Fig. 2 plots the performance curves of execution

5 10 15 20 25 3020

30

40

50

60

(b) |QI|=8, |D|=45222

Exe

cutio

n tim

e(Se

cond

)

k

TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

e for various k.

Page 8: MAGE: A semantics retaining K-anonymization method for mixed data

5 10 15 20 25 30 35 40 45 500

5

10

15

20

25

30 TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

(a) k=5, |QI|=6

Exe

cutio

n T

ime(

Seco

nd)

Size of Dataset(unit=1000)5 10 15 20 25 30 35 40 45 50

0

5

10

15

20

25

30

TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

(b) k=10, |QI|=6

Exe

cutio

n T

ime(

Seco

nd)

Size of Dataset(unit=1000)

Fig. 3. Execution time for various size datasets.

6 7 80

20

40

60

80 TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

(a) k=5, |D|=45222

Exe

cutio

n tim

e(Se

cond

)

Size of Quasi-identifier |QI|6 7 8

0

20

40

60

80 TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

(b) k=10, |D|=45222

Exe

cutio

n tim

e(Se

cond

)

Size of Quasi-identifier |QI|

Fig. 4. Execution time for various sizes of quasi-identifier.

5 10 15 20 25

200

300

400

500

600

700

Info

rmat

in L

oss(

ILh*0

.01)

k(a) |QI|=6, |D|=45222

TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

5 10 15 20 25

400

600

800

1000

1200

1400

Info

rmat

in L

oss(

ILh*0

.01)

k(b) |QI|=8, |D|=45222

TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

Fig. 5. Information loss for various k.

82 J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86

time over various k for the three algorithms. Fig. 2 shows that eachalgorithm’s execution time is similar over various k. This is becausethat the time complexity of TSCKA is O(c⁄n⁄t + n2/c)(seeing Section4), which has nothing to do with k. Fig. 3 plots the performancecurves of execution time over various size of datasets. Fig. 3 indi-cates that the execution time of the three algorithms increases asthe dataset size increasing. This is because that with dataset sizeincreasing, more tuples need to be clustered, which takes moretime. Fig. 4 plots the performance curves of execution time over

various size of quasi-identifier. Figs. 2–4 also indicate that underthe same condition, the TSCKA-MA algorithm is the most efficient,the TSCKA-MAGE algorithm is second efficient, and the TSCKA-GEalgorithm is the slowest. This is because calculating generalizationdistance (seeing Definition 6) needs more time than calculatingmicroaggregation distance (seeing Definition 9). Both the TSCKA-GE algorithm and the TSCKA-MAGE algorithm need calculate gener-alization distance, and hence they take more execution time. How-ever the TSCKA-MA algorithm only calculates microaggregation

Page 9: MAGE: A semantics retaining K-anonymization method for mixed data

6 7 8

200

400

600

800

Info

rmat

ion

Los

s(IL

h*0.0

1)

Size of Quasi-identifier |QI|(a) k=5, |D|=45222

TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

6 7 8200

400

600

800

1000

Info

rmat

ion

Los

s(IL

h*0.0

1)

Size of Quasi-identifier |QI|(b) k=10, |D|=45222

TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

Fig. 7. Information loss for various size of quasi-identifier.

5 10 15 20 25 30 35 40 45 5050

100

150

200

250

300

350

400

450

Info

rmat

ion

Los

s(IL

h*0.0

1)

Size of Dataset(unit=1000)(a) k=5, |QI|=6

TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

5 10 15 20 25 30 35 40 45 5050

100

150

200

250

300

350

400

450

Info

rmat

ion

Los

s(IL

h*0.0

1)

Size of Dataset(unit=1000)(b) k=10, |QI|=6

TSCKA-MA algorithm TSCKA-MAGE algorithm TSCKA-GE algorithm

Fig. 6. Information loss for various size datasets.

J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86 83

distance, so it is the most efficient. The TSCKA-GE algorithm takesmore time to calculate generalization distance, and hence it isslower than TSCKA-MAGE algorithm.

0

100

200

300

400

500

1093 87654

Exe

cutio

n T

ime(

Seco

nd)

k|QI|=6, |D|=45222

KACATSCKAIncognito

2

Fig. 8. Execution time for various k.

5.2.2. Data utilityWe evaluate the information loss of anonymous data generated

by the TSCKA-MAGE algorithm, the TSCKA-GE algorithm and theTSCKA-MA algorithm using formula (14), where a = b = 0.5. Fig. 5plots the performance curves of the information loss of anonymousdataset over various k by the three algorithms. Fig. 6 indicates thatthe information loss of anonymous dataset by the three algorithmsincreases as the k value increasing. This is because that with kincreasing, more tuples need to be identical in each equivalenceclass, more information loss will be generated. Fig. 7 plots the per-formance curves of the information loss of anonymous dataset overvarious size datasets. Fig. 7 indicates that the information loss ofthe three algorithms increases as the dataset increasing. Fig. 7 plotsthe curves of the information loss of anonymous dataset by thethree algorithms over various size of quasi-identifier. Figs. 5–7 alsoindicate that under the same condition, the TSCKA-MAGE algo-rithm is much better than the TSCKA-GE algorithm and similar tothe TSCKA-MA algorithm. In practical application, k generally isno more than 10, and the information loss of the TSCKA-MAGEalgorithm generally is no more than that of TSCKA-MA algorithm.In addition, the TSCKA-MAGE can preserve more statistical charac-teristic for categorical data than the TSCKA-MA algorithm. So it isbetter than the TSCKA-MA algorithm in terms of data utility.

5.3. Comparison of other algorithms

In this Subsection, we will compare the TSCKA algorithm withwell-known Incognito algorithm [6] and the KACA algorithm [11]in terms of efficiency and data utility.

5.3.1. EfficiencyFig. 8 plots the performance curves of the execution time over

various k for the three algorithms. Fig. 8 shows that execution time

Page 10: MAGE: A semantics retaining K-anonymization method for mixed data

2000

2400

.01)

KACA TSCKA Incognito

84 J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86

of the TSCKA algorithm and the Incognito algorithm change littlewith k increasing. However, execution time of the KACA algorithmincreases with k increasing. This is because that the KACA is a bot-tom-up clustering algorithm, clustering time will increase with kincreasing. Fig. 9 plots the performance curves of the executiontime over various size datasets. Fig. 10 plots the performancecurves of the execution time over various size of quasi-identifier.It is reasonable that the execution time of the three algorithms in-crease with the size of dataset or the size of quasi-identifierincreasing. Figs. 8–10 also indicate that under the same condition,the Incognito algorithm is the most efficient; the TSCKA algorithmis similar to the Incognito algorithm; the KACA algorithm is theleast efficient.

0

400

800

1200

1600

1093 87654

Info

rmat

ion

Los

s(IL

g*0

k|QI|=6, |D|=45222

2

Fig. 11. Information loss for various k.

5.3.2. Data utilityIn this set of experiments, we compare the TSCKA algorithm with

the Incognito algorithm and the KACA algorithm in terms of datautility. We know that the Incognito algorithm is a global recodinggeneralization algorithm and the KACA algorithm is a local recodinggeneralization algorithm. So we use the sum of generalization dis-tortion of tuples to evaluate the information loss for the three algo-rithms through formula (11). Fig. 11 plots the curves of theinformation loss of anonymous dataset over various k for the threealgorithms. Fig. 12 plots the curves of the information loss of anon-ymous dataset over various size datasets. Fig. 13 plots the curves ofthe information loss of anonymous dataset over various size ofquasi-identifier. It is reasonable that the information loss will in-crease with parameter k, size of dataset or size of quasi-identifier.

0

100

200

300

400

Exe

cutio

n T

ime(

Seco

nd)

Size of Dataset(unit=1000)k=10, |QI|=6

KACATSCKAIncognito

5 10 15 20 25 30 35 40 45

Fig. 9. Execution time for various size of dataset.

0

100

200

300

400

500

600

87654

Exe

cutio

n T

ime(

Seco

nd)

Size of Quasi-identifier |QI |k=10, |D |=45222

KACATSCKAIncognito

3

Fig. 10. Execution time for various size of quasi-identifier.

Figs. 11–13 also indicate that under the same condition, the TSCKAalgorithm is much better than the Incognito algorithm and similarto the KACA algorithm in terms of data utility.

From experiments of this subsection, we conclude that theTSCKA can achieve better trade-off between data utility and algo-rithm efficiency than the Incognito and KACA.

0

400

800

1200

1600

2000

Info

rmat

ion

Los

s(IL

g*0.0

1)

KACA TSCKA Incognito

5 10 15 20 25 30 35 40 45

Size of Dataset(unit=1000) k=10, |QI|=6

Fig. 12. Information loss for various size of dataset.

0

400

800

1200

1600

2000

2400

87654

Info

rmat

ion

Los

s(IL

g*0.0

01)

Size of Quasi-identifier |QI|k=10, |D|=45222

KACATSCKAIncognito

3

Fig. 13. Information loss for various size of quasi-identifier.

Page 11: MAGE: A semantics retaining K-anonymization method for mixed data

J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86 85

6. Related work

Data anonymization has become a major approach to privacypreserving data publishing, whose idea is to divide a given datasetinto several disjoint groups and release some general informationabout these groups to satisfy some anonymous constraints. Exist-ing works on this area can be classified into two subtopics: thetechniques of anonymization and anonymous models. We willmake a brief review on these works about the two subtopics. Amore in-depth survey on various privacy preserving models andanonymization techniques can be found in a survey paper [10].

The techniques or methods of anonymization can be broadlyclassified into three categories: generalization, microaggregation,anatomy. In 2002, Sweeney [1] presented a theoretical MinGen(Minimal Generalization) algorithm based on generalization/sup-pression techniques. Hereafter, many practical generalization/sup-pression algorithms were proposed, which can be divided into twocategories: global recoding and local recoding. LeFevre et al. [11]proposed a full-domain generalization k-anonymity algorithm -Incognito, which is efficient, but the data utility is low. Li et al.[6] introduced a measurement method of weighted hierarchicaldistance and proposed a local recoding k-anonymity clusteringalgorithm, which can achieve anonymous data with high data util-ity, however the efficiency is lower than Incognito. Xiao and Tao[12] proposed a personal anonymity approach to increasing datautility by minimizing the generalization satisfying personal privacyrequirements. Jiang and Clifton [13] investigated k-anonymity ondistributed framework. Byun et al. [14], Truta and Campan [15]proposed a k-anonymity algorithm for incremental dataset. Liand Li [16] investigated generalization modes and proposed opti-mal generalization algorithm.

However, generalization has some defects on preservingnumerical semantics [4]. Recently, microaggregation techniquesare introduced to k-anonymize microdata. Microaggregation tech-niques are firstly used in anonymizing numerical microdata. Later,it is extended to categorical data [17]. Oganian and Domingo-Fer-rer [18] proved that multivariate microaggregation problem is anNP-hard problem. Therefore, several heuristic methods have beenproposed, which can be categorized into two classes: fixed-sizemicroaggregation algorithms and variable-size microaggregationalgorithms. The former are efficient, but may lead to mistaken clus-tering, which include: MD algorithm [19], MDAV algorithm [20],MDAV-generic algorithm, etc. The latter can generate better qual-ity k-anonymity data but their efficiency is low, which includes:MST algorithm [21], TFRP algorithm [22], l-Approx algorithm[23], density-based microaggregation algorithm [24], evolutionaryalgorithm [25], etc. However, microaggregation techniques cannotprocess categorical data perfectly, since microaggregation uses sta-tistic such as mode or median as centroid for categorical data,which would lose more semantics on k-anonymizing categoricaldata.

Anatomy [26] is also a popular method to anonymize microda-ta, whose idea is to release quasi-identifier and sensitive valuesdirectly in two separate tables. However, releasing the QI-valuesdirectly may suffer from higher breach probability than generaliza-tion. To overcome these drawbacks, Tao et al. [27] proposedANGEL, a new anonymization method that is as effective asgeneralization in privacy protection and can retain higher datautility. Li et al. [28] proposed slicing, which anonymizes microdataby partitioning microdata horizontally and vertically. Mogre et al.[29] concluded that slicing preserves data utility better than gener-alization, in addition, it also prevents membership disclosure.

Another subtopic in this area is anonymous models. K-anonym-ity is the most famous model in this area. However, k-anonymitycannot resist homogeneity attack and background knowledge

attack, so Machanavajjhala et al. [30] introduced l-diversity princi-ple, which required the diversity of sensitive values in each equiv-alence class should be no less than l so as to enhance the difficultyto link the sensitive values to a specific individual. Literature[31,32] extended l-diversity model to process numerical sensitiveattributes. Wong et al. [7] proposed (a,k)-anonymity model, whichcan resist inference attack by controlling frequencies of sensitivevalues in each equivalence class. Li et al. [33] proposed a t-close-ness framework to remedy some defects of l-diversity, which re-quired that the distribution of sensitive values in anyequivalence class should be close to the distribution of the valuesin the whole table. Sacharidis et al. [34] introduced the conceptof k-join-anonymity (KJA), which utilized public database duringthe anonymization process to reduce the information loss. Nergizet al. [35] extended the definitions of k-anonymity to multiple rela-tions and proposed two new clustering algorithms to achieve mul-tirelational anonymity.

Our work belongs to the subtopic of anonymization techniques.As discussed in Section 1, existing generalization techniques andmicroaggregation techniques have some defects on anonymizingmixed data. MAGE, integrating the advantages of generalizationtechniques and microaggregation techniques, can retain moresemantics information during anonymizing mixed data.

7. Conclusions

In this paper, we propose an MAGE method to anonymizemixed microdata and an efficient anonymization algorithm –TSCKA. We present the MAGE method which can retain moresemantics for mixed data than a microaggregation method and ageneralization method. The TSCKA algorithm is a generic K-anony-mization algorithm, which can be easily implemented by general-ization, microaggregation and MAGE methods. Experiments showthat the MAGE method can anonymize mixed data effectively.We also show that the TSCKA algorithm is much better than thewell known Incognito algorithm and KACA algorithm in terms oftrade-off between data utility and algorithm efficiency.

This work also initiates several directions for future investiga-tion. For example, in this paper, we focused on implementingk-anonymity using MAGE. Extending the work to implementl-diversity and other enhanced anonymity models using MAGE isone direction. In addition, extending MAGE method to dynamicdataset or republication scenario is another direction.

Acknowledgments

We thank anonymous reviewers for constructive comments,which lead to a substantial improvement of this paper. This workhas been supported by the National Natural Science Foundationof China under Grant Nos. 61170108, 6110019, and 61272130and the Zhejiang Provincial Natural Science Foundation of Chinaunder Grant Nos. Y1100161 and Q13F020007.

Appendix A. Supplementary material

Supplementary data associated with this article can be found, inthe online version, at http://dx.doi.org/10.1016/j.knosys.2013.10.009.

References

[1] L. Sweeney, k-Anonymity: a model for protecting privacy, International Journalof Uncertainty, Fuzziness and Knowledge-Based Systems 10 (5) (2002) 557–570.

[2] P. Samarati, Protecting respondents’ identities in microdata release, IEEETransactions on Knowledge and Data Engineering 13 (6) (2001) 1010–1027.

Page 12: MAGE: A semantics retaining K-anonymization method for mixed data

86 J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86

[3] C.C. Aggarwal, On k-anonymity and the curse of dimensionality, in:Proceedings of International Conference on Very Large Data Bases, 2005, pp.901–909.

[4] J. Domingo-Ferrer, V. Torra, Ordinal, continuous and heterogeneous k-anonymity through microaggregation, Journal of Data Mining andKnowledge Discovery 11 (2) (2005) 195–212.

[5] C. Ting-ting, H. Jian-min, Y. Hui-qun, Y. Juan, An efficient microaggregationalgorithm for mixed data, in: International Conference on Computer Scienceand Software Engineering, 2008, pp.1053–1056.

[6] J. Li, R.C.W. Wong, A.W.C. Fu, J. Pei, Achieving k-anonymity by clustering inattribute hierarchical structure, in: Proceedings of International Conference onData Warehousing and Knowledge Discovery, 2006, pp. 405–416.

[7] R.C.W. Wong, J. Li, A.W.C. Fu, K. Wang, (a,k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing, in: Proceedings ofthe 12th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, 2006, pp. 754–759.

[8] J. Domingo-Ferrer, J.M. Mateo-Sanz, Practical data-oriented microaggregationfor statistical disclosure control, IEEE Transactions on Knowledge and DataEngineering 14 (1) (2002) 189–201.

[9] Z. Huang, Extensions to the k-means algorithm for clustering large data setswith categorical values, Data Mining and Knowledge Discovery 2 (3) (1998)283–304.

[10] B. Fung, K. Wang, R. Chen, P.S. Yu, Privacy-preserving data publishing: a surveyon recent developments, ACM Computing Surveys 42 (4) (2010) 1–53.

[11] K. LeFevre, D.J. DeWitt, R. Ramakrishnan, Incognito: Efficient full-domain k-anonymity, in: Proceedings of ACM SIGMOD International Conference onManagement of Data, 2005, pp. 49–60.

[12] X. Xiao, Y. Tao, Personalized privacy preservation, in: Proceedings of ACMSIGMOD International Conference on Management of Data, 2006, pp. 229–240.

[13] W. Jiang, C. Clifton, A secure distributed framework for achieving k-anonymity,International Journal on Very Large Data Bases 15 (4) (2006) 316–333.

[14] J.W. Byun, Y. Sohn, E. Bertino, N. Li, Secure anonymization for incrementaldatasets, in: 3rd VLDB Workshop on Secure Data Management, 2006, pp. 48–63.

[15] T.M. Truta, A. Campan, K-Anonymization incremental maintenance andoptimization technique, in: Proceedings of ACM Symposium on AppliedComputing, 2007, pp. 380–387.

[16] T. Li, N. Li, Towards optimal k-anonymization, Data & Knowledge Engineering65 (1) (2008) 22–39.

[17] V. Torra, Microaggregation for categorical variables: a median based approach,in: Workshop on Privacy in Statistical Database, 2004, pp. 162–174.

[18] A. Oganian, J. Domingo-Ferrer, On the complexity of optimal microaggregationfor statistical disclosure control, Statistical Journal of United Nations EconomicCommission for Europe 18 (4) (2001) 345–354.

[19] J. Domingo-Ferrer, J.M. Mateo-Sanz, Practical data-oriented microaggregationfor statistical disclosure control, IEEE Transactions on Knowledge and DataEngineering 14 (1) (2002) 189–201.

[20] A. Hundepool, A.V.D. Wetering, R. Ramaswamy, l-ARGUS Version 4.1 Softwareand User’s Manual, Statistics Netherlands, Voorburg NL[EB/OL], 2007. <http://neon.vb.cbs.nl/casc>.

[21] M. Laszlo, S. Mukherjee, Minimum spanning tree partitioning algorithm formicroaggregation, IEEE Transactions on Knowledge and Data Engineering 17(7) (2005) 902–911.

[22] C.C. Chang, Y.C. Li, W.H. Huang, TFRP: an efficient microaggregation algorithmfor statistical disclosure control, Journal of Systems and Software 80 (11)(2007) 1866–1878.

[23] J. Domingo-Ferrer, F. Seb, A. Solanas, A polynomial-time approximation tooptimal multivariate microaggregation, Computer and Mathematics withApplications 55 (4) (2008) 714–732.

[24] J.L. Lin, T.H. Wen, J.C. Heieh, P.C. Chang, Density-based microaggregation forstatistical disclosure control, Expert Systems with Applications 37 (4) (2010)3256–3263.

[25] J. Jimenez, J. Mares, V. Torra, An evolutionary approach to enhance dataprivacy, Soft Computing 15 (7) (2011) 1301–1311.

[26] X. Xiao, Y. Tao, Anatomy: Simple and effective privacy preservation, in:Proceedings of the 32nd International Conference on Very Large Data Bases,2006, pp. 139–150.

[27] Y. Tao, H. Chen, X. Xiao, S. Zhou, D. Zhang, ANGEL: enhancing the utility ofgeneralization for privacy preserving publication, IEEE Transaction onKnowledge and Data Engineering 21 (7) (2009) 1073–1087.

[28] T. Li, N. Li, J. Zhang, I. Molloy, Slicing: a new approach for privacy preservingdata publishing, IEEE Transactions on Knowledge and Data Engineering 24 (3)(2012) 561–574.

[29] N.V. Mogre, G. Agarwal, P. Patil, A review on data anonymization technique fordata publishing, International Journal of Engineering Research & Technology 1(10) (2012).

[30] A. Machanavajjhala, J. Gehrke, D. Kifer, L-diversity: privacy beyond k-anonymity, in: Proceedings of the 22nd International Conference on DataEngineering, 2006, pp. 24–36.

[31] Q. Zhang, N. Koudas, D. Srivastava, T. Yu, Aggregate query answering onanonymized tables, in: Proceedings of International Conference on DataEngineering, 2007, pp. 116–125.

[32] J. Li, Y. Tao, X. Xiao, Preservation of proximity privacy in publishing numericalsensitive data, in: Proceedings of ACM SIGMODInternational Conference onManagement of Data, 2008, pp. 473–486.

[33] N. Li, T. Li, S. Venkatasubramanian, t-closeness: privacy beyond k-anonymityand l-diversity, in: Proceeding of the 23rd International Conference on DataEngineering, 2007, pp. 106–115.

[34] D. Sacharidis, K. Mouratidis, D. Papadias, K-anonymity in the presence ofexternal databases, IEEE Transactions on Knowledge and Data Engineering 22(3) (2010) 392–403.

[35] M.E. Nergiz, C. Clifton, A.E. Nergiz, Multirelational k-anonymity, IEEETransactions on Knowledge and Data Engineering 21 (8) (2009) 1104–1117.