CA2010_Privacy Preserving Data Mining_cam_ready

8/7/2019 CA2010_Privacy Preserving Data Mining_cam_ready

http://slidepdf.com/reader/full/ca2010privacy-preserving-data-miningcamready 1/9

A Neural-Network Clustering-based Algorithm for

Privacy Preserving Data Mining

S. Tsiafoulis1, V. C. Zorkadis

1and D. A. Karras

2

1Data Protection Authority, 1-3 Kifisias Av. 11523 Athens, Greece, e-mail:

[email protected]

2Chalkis Institute of Technology, Automation Dept., Psachna, Evoia, Hellas

(Greece) P.C. 34400, [email protected], [email protected]

Abstract. The increasing use of fast and efficient data mining algorithms in

huge collections of personal data, facilitated through the exponential growth of

technology, in particular in the field of electronic data storage media and

processing power, has raised serious ethical, philosophical and legal issues

related to privacy protection. To cope with these concerns, several privacy

preserving methodologies have been proposed, classified in two categories,

methodologies that aim at protecting the sensitive data and those that aim at

protecting the mining results. In our work, we focus on sensitive data protection

and compare existing techniques according to their anonymity degree achieved,

the information loss suffered and their performance characteristics. The -

diversity principle is combined with k-anonymity concepts, so that background

information can not be exploited to successfully attack the privacy of datasubjects data refer to. Based on Kohonen Self Organizing Feature Maps

(SOMs), we firstly organize data sets in subspaces according to their

information theoretical distance to each other, then create the most relevant

classes paying special attention to rare sensitive attribute values, and finally

generalize attribute values to the minimum extend required so that both the data

disclosure probability and the information loss are possibly kept negligible.

Furthermore, we propose information theoretical measures for assessing the

anonymity degree achieved and empirical tests to demonstrate it.

Keywords: Privacy Enhancing Technologies, SOM, k-anonymity, l-diversity

1 Introduction

Data contained in databases may be personal data, i.e. information that directly or

indirectly identifies an individual, as for instance an address and date of birth that can

be linked with public available datasets and background knowledge and reveal the

identity of an individual. Such a set of attributes is called Quasi-identifier (QI) set.

Data-mining a database can lead to the disclosure of personal data and the

identification of data subjects, i.e. persons the data refer to. But on the other hand

exploiting such databases may offer many benefits to the community and support the



policy and action plan development process, as for instance in case of pandemic. To

address these at first sight contradicting requirements, privacy preserving data mining

techniques have been proposed [1, 2, 3, 4, 5, 6, 10].

Existing privacy-preserving data mining algorithms can be classified into two

categories: algorithms that protect the sensitive data itself in the mining process, and

those that protect the sensitive data mining results [1]. The most popular algorithms in

the data mining research community address k-anonymity and -diversity. They

belong to the first category and apply generalization and suppression methods to the

original datasets in order to preserve the anonymity of individuals or entities data

refer to.

K-anonymity requires each tuple in the published table to be indistinguishable from atleast k-1 other tuples [2]. Tuples with the same or close QI values form an

equivalence class. However, k-anonymity cannot protect against homogeneity and

background knowledge attacks [3]. To address these shortcomings, the l-diversity

principle was proposed [3], which requires that different values of the sensitive

attributes are well represented in each equivalence class, thus preventing an attacker

from guessing the sensitive attribute value for a QI set with probability greater than1/ Distinct -diversity requires that for each equivalence class ei, there are at least

distinct values in ei[S], where ei[S] is the multi-set of ei ’s sensitive attribute values [2,

3].

In table 1, clustering-, partition- and hierarchy–based algorithms for theimplementation of k-anonymity and l-diversity are categorized with respect to their

characteristics, attribute type, searching method and analysis approach used. Due to

structural similarities of k-anonymity and

-diversity algorithms, most of the k-anonymity algorithms can be transformed easily to algorithms for -diversity [3].

In our work, we use the Adult data set provided by Irvine machine learning repository[4], so that our research results can be compared with those presented in the literature

(see section 2), since this database has been used widely in classification experiments.

It consists of 30162 complete records, with 6 numerical and 8 categorical attributes.

This paper is organized as follows. Next section is devoted to existing k-anonymity-

and l-diversity algorithms. In section 3, we propose a new algorithm to the k-anonymity and l-diversity problem, and in section 4, we introduce measures and tests

to evaluate the performance of the proposed algorithm and compare it with the

performance of existing algorithms. Finally, we conclude the paper.

2. k-anonymity and l-diversity algorithms

In [5] two greedy algorithms are proposed. The first is clustering-based and conducts

a bottom-up search, while the second one is partition-based and works top-down. The

selection criterion for an attribute to be merged in an equivalence class is the weight



certainty penalty ( NCP). By using this criterion, information loss and record

importance are taken into account. In bottom-up search, at the beginning of the

anonymization process, each tuple is being treated as an individual group. Each group

whose population is less than k is being merged with another group such that the

combined group has the smallest NCP. It iterates until every group has at least k tuples. In the end of the process, each group that has more than 2k tuples is being split

into such that each group has at least k tuples. In the top-down approach, in the

beginning, the two tuples that cause the highest NCP in case they are merged in the

same group, are being selected and form the two initial groups Gu, Gυ. Then, the

other tuples are being assigned to these groups randomly. The assignment of a tuple w

depends on the NCP(Gu,w) and NCP(Gυ,w), where Gu, Gυ are the groups formed sofar. Tuple w is assigned to the group that leads to a lower NCP. The procedure of the

partitioning is being conducted recursively while the group has k or more tuples. If

one group G has less than k tuples then a group with population greater than 2k-|G’|is

being searched. Then from the group that has been formed, G’= (k -|G|) tuples are

being selected such that NCP (GUG’) is minimized.

In [6], the algorithm starts with a fully generalized dataset, one in which every tuple is

identical to every other, and systematically specializes the dataset into one that is

minimally k-anonymous. This algorithm uses a tree search strategy to find the optimal

solution. An optimal solution is an optimal generalization with the least information

loss and the highest privacy preserving. Considering that this technique can involve

scanning and sorting the entire dataset, it may produce an enormous solution space.

So it uses pruning strategies to reduce the solution space and a dynamic search

rearrangement tree search algorithm named OPUS [7]. Opus extends a systematic set-enumeration-search strategy [8] with dynamic tree rearrangement and cost – based

pruning for solving optimization problems. A node can be pruned only when the

algorithm can determine that none of the descendants or the node itself could be

optimal solution. For this determination a lower bound cost must be computed for any

node within the subtree rooted beneath it. If this lower bound exceeds the current best

cost, the node is pruned. To compute the lower bound cost it uses the discernibility

metric and classification metric [6].

[9] proposes a genetic algorithm to find the optimal anonymization. Every possibleanonymization is being coded and represented with a chromosome. Then, based on

the Genitor algorithm [11], is trying to find the optimal solution, that is the

chromosome with the best evaluation value. For the evaluation it uses the criterion of

the weighted certainty penalty [5]. Also, the generalizations must be consistent with

the restrictions set out in valid generalization notion that was mentioned in section

3.2.a.

BSGi is an algorithm for the implementation of -diversity anonymization. This

algorithm was influenced from Anatomy [12], so firstly “bucketize” the tuples

according to their SA values. Then recursively “select” tuples from the biggest

buckets and group them into an equivalence class. Finally “incorporate” the residual

tuples into a proper equivalence class. This technique also preserves the unique

distinct -diversity model in which each equivalence class has to contain exactly



distinct SA values. To ensure that the equivalence classes that BSGI creates are as

many as possible a method called Μ ax- is performed. According to this method the

tuples are selected from the biggest buckets. Also, in begin of the selecting step

iteration the buckets are sorted according to their sizes. So, in summary, the selection

of records and creation of equivalence classes is as follows:

step 1: The tuples of the dataset are bucketized according to their SA values to Bi

buckets.

step 2: The Bi buckets are sorted according their sizes.

step 3: Randomly one tuple from the first bucket 1 B is selected and creates anequivalence class e .

step 4: From each of the next -1 Bi groups one tuple is selected that minimize the

information loss according to NCP metric and incorporated to e .

step 5: While there is a bucket with more than tuples, steps 1 to 4 are being

repeated.

step 6: Incorporating all residual tuples.

3. A neural network – based k-anonymity and l-diversity algorithm

BSGI which was inspired from “ Anatomy” [12] implements -diversity by firstly

“bucketizing” the tuples according to their SA values and then “greedy” group them

into equivalence classes depending on the similarity to their QI attributes. As it was

mentioned on section 4, it randomly selects a tuple from the largest bucket and tries to

find -1 other tuples from the next largest -1 buckets. Assuming that some “better”

tuples belong to other |D|- buckets then this technique introduces a limitation with

possible information loss.

In our algorithm, tuples regrouped according to their QI similarity. by clustering the

data set using Kohonen networks and more precisely Kohonen Self OrganisingFeature Maps (SOMs) Then, the algorithm bucketizes the tuples according their SA

value in each group. In the next step, in each group it selects a tuple from the smallest

bucket and searches for a similar tuple in the -1 largest buckets from the same group

to create an equivalence class. So, by firstly groupping the tuples according to their

similarity the probability to create more uniform classes is significantly increased.

This leads to better generalization with less information loss. In addition, by taking

care of the rare tuples the probability to suppress rare and valuable tuples is



minimized. By doing so, the proposed algorithm satisfies the “utility based

anonymization” principle stated in [5], so that crucial information is protected from

being suppressed. Also, weights given to tuples improve clustering and give the

ability to control the generalization’s depth. This algorithm uses the benefits of neural

networks for the clustering of the tuples. It starts by clustering the data set using

Kohonen networks and more precisely Kohonen Self Organising Feature Maps (SOMs). After that, in each group that has been created from the Kohonen network

the tuples are bucketized according their SA value. This algorithm uses three labels

for each tuple: one named QIG represents the group that a tuple belongs to, another

named SAL represents the bucket that it belongs, and the third represents the ranking

of the buck a tuple belongs, named SALR. These labels help to the third step of the

algorithm in which the equivalence classes are being created. First, it selects from thesmallest buckets a tuple. Then, the algorithm is searching to the biggest -1 buckets

for the nearest neighborhood in each of them and creates an equivalence class. This

searching is taking place to the group that the selected tuple belongs. At the end of the

third step, if a proper tuple could not been found in the same group, the algorithm is

searching to the next group which is the most common.

Finally, the total weight certainty penalty NCP(T) that mentioned in section 1 and the

discernibility metric C DM mentioned in section 2 are computed for the evaluation of

the algorithm.

4. Coding

Domain Hierarchy

The generalization process of the categorical attributes adopts the model thatrepresented in [9]. It is based on domain generalization hierarchy [10] and extends by

setting the restriction of the valid generalization.

The domain ordering must be supplied by the user. This ordering should correspond

to the order in which the leaves are output by the preorder traversal of the hierarchy.

According to [9] “a generalization A is represented by a set of nodes SA in the

taxonomy tree and it is valid if it satisfies the property that the path from every leaf

node Y to the root entounters exactly one node P in S Α . The value represented by the

leaf node Y is generalized in A to the value represented by the node P.”

Each value domain is denoted with the least value belonging to the interval of the

generalization interval. Even more, values inside a value domain must be ordered.

Then, this technique imposes a total ordering over the set of all attribute domains such

that the values in the ith

attribute domain ( Σ i) all precede the values in any subsequent

domain ( Σ j) for j>i). The least value from each value domain is being omitted. So, the



empty set {} represents the most general anonymization in which the induced

equivalence classes consist of only a single equivalence class of identical tuples.

Adding a new value to an existing anonymization specializes the data while removing

a value generalizes it.

ChromosomesEach chromosome is formed by concatenating the bit strings corresponding to each

potentially identifying column. If the attribute takes numeric values then the length of

the string that refers to this attribute is proportional to the granularity at which the

generalization intervals are defined. A string representing a numeric attribute formed

according to the intervals of the generalization. The bit string for a numeric attribute

is made up of one bit for each potential end point in value order. A value of 1 for a bitimplies that the corresponding value is used as an interval end point in the

generalization [9]. For example if the potential generalization intervals for an attribute

are

[0-20](20-40](40-60] (60-80] (80-100]

Then the chromosome 100111 provides that values 0,60,80,100 are end points, so the

generalized intervals are [0,60](60,100].

For a categorical attribute with D distinct values which are generalized according to

the taxonomy tree T , the number of bits needed for this attribute is D-1. The leaf

nodes which are representing the distinct values are arranged in the order resulting

from an in-order traversal of T . Values of 1 are assigned to the bits of the

chromosomes that are between to leaf nodes and represents that those to leaf nodes

are separated in the generalization. Because some of the newly chromosomes may notbe valid, an additional step to the Genitor algorithm modifies them into valid ones.

5. Performance evaluation of the proposed algorithm and its comparison

with existing algorithms

]. Discernibility metric assign a penalty to each tuple based on how many tuples in the

transformed dataset are indistinguishable from it. This can be mathematically stated

as follows:

2( , ) | | | | | |

. .| | . .| |C g k E D E DM Es t E k Es t E k

∑ ∑= +∀ ≥ ∀ <

(3.1)

where |D| the size of the input dataset, E refer to the equivalence classes of tuples in D

induced by the anonymization g.



Classification metric assigns no penalty to an unsuppressed tuple if it belongs to the

majority class within its induced equivalence class, while all the other tuples are

penalized a value of 1. More precisely:

( )( , ) minority ( ). .| | . .| |

C g k E E CM E s t E k E s t E k

∑ ∑= +∀ ≥ ∀ <

(3.2)

where E is the equivalence class and minority function accepts a class of equivalence

argument and returns all those records which are in the minority class with respect to

the sign class. The first sum gives a penalty to those records which have not been

suppressed, while the second one penalizes suppressed tuples.

6. Conclusions

Table 2 summarizes the above algorithms according to their effectiveness. To be

effective an algorithm for anonymization it has to be fast enough so that could be

practical. Also, must be aware of the information loss that causes, so the anonymised

table could be useful. The anonymization and the management of medical data must

be taken care of with a great concern. The information that those data sets includes isvery sensitive so they have to be protected and very crucial for the humanity health.

So, algorithms cannot indifferent for the rare attributes values and have to distinguish

the more important values from the less important. Our Algorithm is practical while it

is taking care of those aspects

Nume-

ric

categorica

l

Utility-

Based

Anonymi-

zation

k-

anony-

mity

Age,

education

work class,

marital-status,

occupation,

race, gender,

native-

country

Greedy

bottom-

upClustering

top-down Partitioning

Data

Privacy

Through

Optimal k-

Anonymi-

zation

k-

anony-

mity

Age,

education

work class,

marital-status,

occupation,

race, gender,

native-

country

Exhaustive

searchHeuristic depth-first tree

search

Transfor-

ming Data to

Satisfy

Privacy

Cons-trains

k-

anony-

mity

Age,

education

work class,

marital-status,

occupation,

race, gender,

native-

country

Genetic

BSGI -

diversi-

Age, final-

weight,

marital-status,

race, genderGreedy clustering



ty education,

Hours per

week

TABLE 1. Categorization of the Algorithms According to their Characteristics

complexity

average

time

(sec)

technical

characte

ristics

Discernability

metric

Certainty metric

Utility-

Based

Anony

mi-

zation

bottom

-upO(log2k|T|2) 200

512MB

RAM

2.0 GHz

Pentium

ivMicrosoft

Windows

XP

k=25 k=100 k=25 k=100

top-

down O(|T|2) 60

2x104 4x104 17x105 15x106

Data Privacy

Though Optimal

k-Anonymization

5400

2.8 GHz

Intel

Xeon

(only one

processor

was

used)

Linux OS

(kernel2.

4.20)

15x156 18x156 k=25

Classifi

cation

metric

=5320

k=100

Classifi

cation

metric

=5460

Transforming

Data to Satisfy

Privacy

Constrains

18hours

(15060

records)

1GBRAM

1GHz

Pentium

III

IBM

6868

Intellistat

ion

BSGI O(|T|2) 10-20

1GB

RAM

2.8 GHz

Pentium

D

MicrosoftWindows

Server

2003

=4

10x104

=7

12x104

=4

2x103

=7

3x103

TABLE 2. General Characteristics According the Effectiveness of the Algorithms



References

[1]. Verykios, A.G.-D.a.V.S., An Overview of Privacy Preserving Data Mining. Crossroads

archive Article No. 6 June 2009. 15( 4 ).

[2]. Yu Liu, D.L., Chi Wang, Jianhua Feng, Qiao Deng, Yang Ye, BSGI An Effective

Algorithm towards Stronger l-Diversity. Applications table of contents Turin, Italy, 2008

(Data Privacy table of contents): p. 19 - 32

[3] Ashwin Machanavajjhala, D.K., Johannes Gehrke and Muthuramakrishnan

Venkitasubramaniam, L-Diversity: Privacy Beyond k-Anonymity. ACM Transactions on

Knowledge Discovery from Data, Vol. 1, No. 1, Article 3, March 2007. 1: p. 52.

[4]. UCI, Irvin Machine Learning Repository.

[5]. Jian Xu, W.W., Jian Pei, Xiaoyuan Wang, Baile Shi, Ada Wai-Chee Fu, Utility-Based

Anonymization Using Local Recoding. 2006.

[6] R Bayardo, R.A.-. Data privacy through optimal k-anonymization. Proceedings on 21stInternational Conference 2005.

[7] Webb, G.I., Opus:An Effcient Admissible Algorithm for Unordered Search. 1995.

[8] Rymon, R., Search Through Systematic Set Enumeration. 1992.

[9] Iyengar, V.S., Transforming Data to Satisfy Privacy Constrains. 2002.

[10]Sweeney, L., Achieving k-anonymity privacy protection using generalization and

suppression. 2002.

[11]D.Whitley, The Genitor Algorithm and Selective Pressure: Why rank-based allocation of

reproductive trials is best. In Proceedings of Third International Conference on Genetic

Algorithms, 1989: p. 116-121.

[12]Xiao, X., Tao, Y, Anatomy: Simple and effective privacy preservation. VLDB, 2006: p.

139-150.

CA2010_Privacy Preserving Data Mining_cam_ready

Documents

Transcript of CA2010_Privacy Preserving Data Mining_cam_ready