Privacy Preserving Data Mining, a Data Quality...

56
Privacy Preserving Data Mining, a Data Quality Approach Igor Nai Fovino and Marcelo Masera EUR 23070 EN - 2008

Transcript of Privacy Preserving Data Mining, a Data Quality...

Page 1: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

Privacy Preserving Data Mininga Data Quality Approach

Igor Nai Fovino and Marcelo Masera EUR 23070 EN - 2008

The Institute for the Protection and Security of the Citizen provides research-based systems-oriented support to EU policies so as to protect the citizen against economic and technological risk The Institute maintains and develops its expertise and networks in information communication space and engineering technologies in support of its mission The strong cross-fertilisation between its nuclear and non-nuclear activities strengthens the expertise it can bring to the benefit of customers in both domains European Commission Joint Research Centre Institute for the Protection and Security of the Citizen Contact information Address Via E Fermi I-21020 Ispra (VA) ITALY E-mail igornaijrcit Tel +39 0332786541 Fax +39 0332789576 httpipscjrceceuropaeu httpwwwjrceceuropaeu Legal Notice Neither the European Commission nor any person acting on behalf of the Commission is responsible for the use which might be made of this publication

Europe Direct is a service to help you find answers to your questions about the European Union

Freephone number ()

00 800 6 7 8 9 10 11

() Certain mobile telephone operators do not allow access to 00 800 numbers or these calls may be billed

A great deal of additional information on the European Union is available on the Internet It can be accessed through the Europa server httpeuropaeu JRC JRC42700 EUR 23070 EN ISSN 1018-5593 Luxembourg Office for Official Publications of the European Communities copy European Communities 2008 Reproduction is authorised provided the source is acknowledged Printed in Italy

Privacy Preserving Data Mining A Data Quality

Approach

Igor Nai Fovino Marcelo Masera

16th January 2008

2

Contents

Introduction 5

1 Data Quality Concepts 7

2 The Data Quality Scheme 11

201 An IQM Example 13

3 Data Quality Based Algorithms 19

31 ConMine Algorithms 2032 Problem Formulation 2033 Rule hiding algorithms based on data deletion 2134 Rule hiding algorithms based on data fuzzification 2135 Considerations on Codmine Association Rule Algorithms 2536 The Data Quality Strategy (First Part) 2637 Data Quality Distortion Based Algorithm 2838 Data Quality Based Blocking Algorithm 3139 The Data Quality Strategy (Second Part) 32310 Pre-Hiding Phase 35311 Multi Rules Zero Impact Hiding 36312 Multi Rule Low Impact algorithm 38

4 Conclusions 41

3

4 CONTENTS

Introduction

IntroductionDifferent approaches have been introduced by researchers in the field of pri-

vacy preserving data mining These approaches have in general the advantageto require a minimum amount of input (usually the database the information toprotect and few other parameters) and then a low effort is required to the userin order to apply them However some tests performed by Bertino Nai andParasiliti [1097] show that actually some problems occur when applying PPDMalgorithms in contexts in which the meaning of the sanitized data has a criticalrelevance More specifically these tests (see Appendix A for more details) showthat the sanitization often introduces false information or completely changesthe meaning of the information These phenomenas are due to the fact thatcurrent approaches to PPDM algorithms do not take into account two relevantaspects

bull Relevance of data not all the information stored in the database hasthe same level of relevance and not all the information can be dealt inthe same way An example may clarify this concept Consider a categoricdatabase1 and assume we wish to hide the rule AB rarr CD A privacy-preserving association mining algorithm would retrieve all transactionssupporting this rule and then change some items (eg item C) in thesetransactions in order to change the confidence of the rule Clearly if theitem C has low relevance the sanitization will not affect much the qualityof the data If however item C represents some critical information thesanitization can result in severe damages to the data

bull Structure of the database information stored in a database is stronglyinfluenced by the relationships between the different data items Theserelationships are not always explicit Consider as an example an HealthCare database storing patientrsquos records The medicines taken by patientsare part of this database Consider two different medicines say A and B

and assume that the following two constraints are defined ldquoThe maximumamount of medicine A per day must be 10ml If the patient also takes

1We define a categoric database a database in which attributes have a finite (but eventuallylarge) number of distinct values with no ordering among the values An example is the MarketBasket database

5

6 CONTENTS

medicine B the max amount of A must be 5mlrdquo If a PPDM algorithmmodifies the data concerning to the amount of medicine A without takinginto account these constraints the effects on the health of the patientcould be potentially catastrophic

Analyzing these observations we argue that the general problem of currentPPDM algorithms is related to Data Quality In [70] a well known work in thedata quality context Orr notes that data quality and data privacy are in someway often related More precisely an high quality of data often corresponds alow level of privacy It is not simple to explain this correlation We try in thisintroduction to give an intuitive explanation of this fact in order to give an ideaof our intuition

Any ldquoSanitizationrdquo on a target database has an effect on the data qualityof the database It is obvious to deduce that in order to preserve the qualityof the database it is necessary to take into consideration some function Π(DB)representing the relevance and the use of the data contained in the DB Nowby taking another little step in the discussion it is evident that the functionΠ(DB) has a different output for different databases2 This implies that aPPDM algorithm trying to preserve the data quality needs to know specificinformation about the particular database to be protected before executing thesanitizationWe believe then that the data quality is particularly important for the case ofcritical database that is database containing data that can be critical for thelife of the society (eg medical database Electrical control system databasemilitary Database etc) For this reason data quality must be considered as thecentral point of every type of data sanitizationIn what follows we present a more detailed definition of DQ We introduce aschema able to represent the Π function and we present two algorithms basedon DQ concept

2We use here the term ldquodifferentrdquo in its wide sense ie different meaning of date differentuse of data different database architecture and schema etc

Chapter 1

Data Quality Concepts

Traditionally DQ is a measure of the consistency between the data views pre-sented by an information system and the same data in the real-world [70] Thisdefinition is strongly related with the classical definition of information systemas a ldquomodel of a finite subset of the real worldrdquo [56] More in detail Levitinand Redman [59] claim that DQ is the instrument by which it is possible toevaluate if data models are well defined and data values accurate The mainproblem with DQ is that its evaluation is relative [94] in that it usually de-pends from the context in which data are used DQ can be assumed as acomplex characteristic of the data itself In the scientific literature DQ is con-sidered a multi-dimensional concept that in some environments involves bothobjective and subjective parameters [8104105] Table 1 lists of the most usedDQ parameters [104]

Accuracy Format Comparability Precision ClarityReliability Interpretability Conciseness Flexybility UsefulnessTimeliness Content Freedom from bias Understandability QuantitativenessRelevance Efficiency Informativeness Currency SufficiencyCompleteness Importance Level of Detail consistence Usableness

Table 11 Data Quality parameters

In the context of PPDM DQ has a similar meaning with however an impor-tant difference The difference is that in the case of PPDM the real world is theoriginal database and we want to measure how closely the sanitized databaseis to the original one with respect to some relevant properties In other wordswe are interested in assessing whether given a target database the sanitizationphase will compromise the quality of the mining results that can be obtainedfrom the sanitized database In order to understand better how to measure DQwe introduce a formalization of the sanitization phase In a PPDM process agiven database DB is modified in order to obtain a new database DB prime

7

8 CHAPTER 1 DATA QUALITY CONCEPTS

Definition 1

Let DB be the database to be sanitized Let ΓDB be the set of all the aggregateinformation contained in DB Let ΞDB sube ΓDB be the set of the sensitive in-formation to hide A transformation ξ D rarr D where D is the set of possibleinstances of a DB schema is a perfect sanitization if Γξ(DB) = ΓDB minus ΞDB

In other words the ideal sanitization is the one that completely removes thesensitive high level information while preserving at the same time the otherinformation This would be possible in the case in which constraints and rela-tionship between the data and between the information contained in the data-base do not exist or roughly speaking assuming the information to be a simpleaggregation of data if the intersection between the different information is al-ways equal to empty However this hypothetical scenario due to the complexityof modern databases in not possible In fact as explained in the introductionof this section we must take into account not only the high level informationcontained in the database but we must also consider the relationships betweendifferent information or different data that is the Π function we mentioned inthe previous section In order to identify the possible parameters that can beused to assess the DQ in the context of PPDM Algorithms we performed a setof tests More in detail we perform the following operations

1 We built a set of different types of database with different data structuresand containing information completely different from the meaning and theusefulness point of view

2 We identified a set of information judged relevant for every type of data-base

3 We applied a set of PPDM algorithms to every database in order to protectthe relevant information

Observing the results we identified the following four categories of damages(or loss of quality) to the informative asset of the database

bull Ambiguous transformation This is the case in which the transforma-tion introduces some uncertainty in the non sensitive information thatcan be then misinterpreted It is for example the case of aggregation algo-rithms and Perturbation algorithms It can be viewed even as a precisionlack when for example some numerical data are standardized in order tohide some information

bull Incomplete transformation The sanitized database results incom-plete More specifically some values of the attributes contained in thedatabase are marked as ldquoBlankrdquo For this reason information may resultincomplete and cannot be used This is typically the effect of the BlockingPPDM algorithm class

bull Meaningless transformation In this case the sanitized database con-tains information without meaning That happen in many cases when

9

perturbation or heuristic-based algorithms are applied without taking intoaccount the meaning of the information stored into the database

bull Implicit Constraints Violation Every database is characterized bysome implicit constraints derived from the external world that are notdirectly represented in the database in terms of structure and relation (theexample of medicines A and B presented in the previous section could bean example) Transformations that do not consider these constraints riskto compromise the database introducing inconsistencies in the database

The results of these experiments magnify the fact that in the context ofprivacy preserving data mining only a little portion of the parameters showedin table 11 are of interest More in details are relevant the parameters allowingto capture a possible variation of meaning or that are indirectly related withthe meaning of the information stored in a target database Therefore weidentify as most appropriate DQ parameters for PPDM Algorithms the followingdimensions

bull Accuracy it measures the proximity of a sanitized value arsquo to the originalvalue a In an informal way we say that a tuple t is accurate if thesanitization has not modified the attributes contained in the tuple t

bull Completeness it evaluates the percentage of data from the original data-base that are missing from the sanitized database Therefore a tuple t iscomplete if after the sanitization every attribute is not empty

bull Consistency it is related to the semantic constraints holding on the dataand it measures how many of these constraints are still satisfied after thesanitization

We now present the formal definitions of those parameters for use in theremainder of the discussion

Let OD be the original database and SD be the sanitized database resultingfrom the application of the PPDM algorithm Without loosing generality andin order to make simpler the following definitions we assume that OD (and con-sequently SD) be composed by a single relation We also adopt the positionalnotation to denote attributes in relations Thus let odi (sdi) be the i-th tuplein OD (SD) then odik (sdik) denotes the kth attribute of odi (sdi) Moreoverlet n be the total number of the attributes of interest we assume that attributesin positions 1 m (m le n) are the primary key attributes of the relation

Definition 2

Let sdj be a tuple of SD We say that sdj is Accurate if notexistodi isin OD suchthat ((odik = sdjk)forallk = 1m and exist(odif 6= sdjf ) (sdjf 6= NULL) f = m +1 n))Definition 3

A sdj is Complete if (existodi isin OD such that (odik = sdjk)forallk = 1m) and

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 2: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

The Institute for the Protection and Security of the Citizen provides research-based systems-oriented support to EU policies so as to protect the citizen against economic and technological risk The Institute maintains and develops its expertise and networks in information communication space and engineering technologies in support of its mission The strong cross-fertilisation between its nuclear and non-nuclear activities strengthens the expertise it can bring to the benefit of customers in both domains European Commission Joint Research Centre Institute for the Protection and Security of the Citizen Contact information Address Via E Fermi I-21020 Ispra (VA) ITALY E-mail igornaijrcit Tel +39 0332786541 Fax +39 0332789576 httpipscjrceceuropaeu httpwwwjrceceuropaeu Legal Notice Neither the European Commission nor any person acting on behalf of the Commission is responsible for the use which might be made of this publication

Europe Direct is a service to help you find answers to your questions about the European Union

Freephone number ()

00 800 6 7 8 9 10 11

() Certain mobile telephone operators do not allow access to 00 800 numbers or these calls may be billed

A great deal of additional information on the European Union is available on the Internet It can be accessed through the Europa server httpeuropaeu JRC JRC42700 EUR 23070 EN ISSN 1018-5593 Luxembourg Office for Official Publications of the European Communities copy European Communities 2008 Reproduction is authorised provided the source is acknowledged Printed in Italy

Privacy Preserving Data Mining A Data Quality

Approach

Igor Nai Fovino Marcelo Masera

16th January 2008

2

Contents

Introduction 5

1 Data Quality Concepts 7

2 The Data Quality Scheme 11

201 An IQM Example 13

3 Data Quality Based Algorithms 19

31 ConMine Algorithms 2032 Problem Formulation 2033 Rule hiding algorithms based on data deletion 2134 Rule hiding algorithms based on data fuzzification 2135 Considerations on Codmine Association Rule Algorithms 2536 The Data Quality Strategy (First Part) 2637 Data Quality Distortion Based Algorithm 2838 Data Quality Based Blocking Algorithm 3139 The Data Quality Strategy (Second Part) 32310 Pre-Hiding Phase 35311 Multi Rules Zero Impact Hiding 36312 Multi Rule Low Impact algorithm 38

4 Conclusions 41

3

4 CONTENTS

Introduction

IntroductionDifferent approaches have been introduced by researchers in the field of pri-

vacy preserving data mining These approaches have in general the advantageto require a minimum amount of input (usually the database the information toprotect and few other parameters) and then a low effort is required to the userin order to apply them However some tests performed by Bertino Nai andParasiliti [1097] show that actually some problems occur when applying PPDMalgorithms in contexts in which the meaning of the sanitized data has a criticalrelevance More specifically these tests (see Appendix A for more details) showthat the sanitization often introduces false information or completely changesthe meaning of the information These phenomenas are due to the fact thatcurrent approaches to PPDM algorithms do not take into account two relevantaspects

bull Relevance of data not all the information stored in the database hasthe same level of relevance and not all the information can be dealt inthe same way An example may clarify this concept Consider a categoricdatabase1 and assume we wish to hide the rule AB rarr CD A privacy-preserving association mining algorithm would retrieve all transactionssupporting this rule and then change some items (eg item C) in thesetransactions in order to change the confidence of the rule Clearly if theitem C has low relevance the sanitization will not affect much the qualityof the data If however item C represents some critical information thesanitization can result in severe damages to the data

bull Structure of the database information stored in a database is stronglyinfluenced by the relationships between the different data items Theserelationships are not always explicit Consider as an example an HealthCare database storing patientrsquos records The medicines taken by patientsare part of this database Consider two different medicines say A and B

and assume that the following two constraints are defined ldquoThe maximumamount of medicine A per day must be 10ml If the patient also takes

1We define a categoric database a database in which attributes have a finite (but eventuallylarge) number of distinct values with no ordering among the values An example is the MarketBasket database

5

6 CONTENTS

medicine B the max amount of A must be 5mlrdquo If a PPDM algorithmmodifies the data concerning to the amount of medicine A without takinginto account these constraints the effects on the health of the patientcould be potentially catastrophic

Analyzing these observations we argue that the general problem of currentPPDM algorithms is related to Data Quality In [70] a well known work in thedata quality context Orr notes that data quality and data privacy are in someway often related More precisely an high quality of data often corresponds alow level of privacy It is not simple to explain this correlation We try in thisintroduction to give an intuitive explanation of this fact in order to give an ideaof our intuition

Any ldquoSanitizationrdquo on a target database has an effect on the data qualityof the database It is obvious to deduce that in order to preserve the qualityof the database it is necessary to take into consideration some function Π(DB)representing the relevance and the use of the data contained in the DB Nowby taking another little step in the discussion it is evident that the functionΠ(DB) has a different output for different databases2 This implies that aPPDM algorithm trying to preserve the data quality needs to know specificinformation about the particular database to be protected before executing thesanitizationWe believe then that the data quality is particularly important for the case ofcritical database that is database containing data that can be critical for thelife of the society (eg medical database Electrical control system databasemilitary Database etc) For this reason data quality must be considered as thecentral point of every type of data sanitizationIn what follows we present a more detailed definition of DQ We introduce aschema able to represent the Π function and we present two algorithms basedon DQ concept

2We use here the term ldquodifferentrdquo in its wide sense ie different meaning of date differentuse of data different database architecture and schema etc

Chapter 1

Data Quality Concepts

Traditionally DQ is a measure of the consistency between the data views pre-sented by an information system and the same data in the real-world [70] Thisdefinition is strongly related with the classical definition of information systemas a ldquomodel of a finite subset of the real worldrdquo [56] More in detail Levitinand Redman [59] claim that DQ is the instrument by which it is possible toevaluate if data models are well defined and data values accurate The mainproblem with DQ is that its evaluation is relative [94] in that it usually de-pends from the context in which data are used DQ can be assumed as acomplex characteristic of the data itself In the scientific literature DQ is con-sidered a multi-dimensional concept that in some environments involves bothobjective and subjective parameters [8104105] Table 1 lists of the most usedDQ parameters [104]

Accuracy Format Comparability Precision ClarityReliability Interpretability Conciseness Flexybility UsefulnessTimeliness Content Freedom from bias Understandability QuantitativenessRelevance Efficiency Informativeness Currency SufficiencyCompleteness Importance Level of Detail consistence Usableness

Table 11 Data Quality parameters

In the context of PPDM DQ has a similar meaning with however an impor-tant difference The difference is that in the case of PPDM the real world is theoriginal database and we want to measure how closely the sanitized databaseis to the original one with respect to some relevant properties In other wordswe are interested in assessing whether given a target database the sanitizationphase will compromise the quality of the mining results that can be obtainedfrom the sanitized database In order to understand better how to measure DQwe introduce a formalization of the sanitization phase In a PPDM process agiven database DB is modified in order to obtain a new database DB prime

7

8 CHAPTER 1 DATA QUALITY CONCEPTS

Definition 1

Let DB be the database to be sanitized Let ΓDB be the set of all the aggregateinformation contained in DB Let ΞDB sube ΓDB be the set of the sensitive in-formation to hide A transformation ξ D rarr D where D is the set of possibleinstances of a DB schema is a perfect sanitization if Γξ(DB) = ΓDB minus ΞDB

In other words the ideal sanitization is the one that completely removes thesensitive high level information while preserving at the same time the otherinformation This would be possible in the case in which constraints and rela-tionship between the data and between the information contained in the data-base do not exist or roughly speaking assuming the information to be a simpleaggregation of data if the intersection between the different information is al-ways equal to empty However this hypothetical scenario due to the complexityof modern databases in not possible In fact as explained in the introductionof this section we must take into account not only the high level informationcontained in the database but we must also consider the relationships betweendifferent information or different data that is the Π function we mentioned inthe previous section In order to identify the possible parameters that can beused to assess the DQ in the context of PPDM Algorithms we performed a setof tests More in detail we perform the following operations

1 We built a set of different types of database with different data structuresand containing information completely different from the meaning and theusefulness point of view

2 We identified a set of information judged relevant for every type of data-base

3 We applied a set of PPDM algorithms to every database in order to protectthe relevant information

Observing the results we identified the following four categories of damages(or loss of quality) to the informative asset of the database

bull Ambiguous transformation This is the case in which the transforma-tion introduces some uncertainty in the non sensitive information thatcan be then misinterpreted It is for example the case of aggregation algo-rithms and Perturbation algorithms It can be viewed even as a precisionlack when for example some numerical data are standardized in order tohide some information

bull Incomplete transformation The sanitized database results incom-plete More specifically some values of the attributes contained in thedatabase are marked as ldquoBlankrdquo For this reason information may resultincomplete and cannot be used This is typically the effect of the BlockingPPDM algorithm class

bull Meaningless transformation In this case the sanitized database con-tains information without meaning That happen in many cases when

9

perturbation or heuristic-based algorithms are applied without taking intoaccount the meaning of the information stored into the database

bull Implicit Constraints Violation Every database is characterized bysome implicit constraints derived from the external world that are notdirectly represented in the database in terms of structure and relation (theexample of medicines A and B presented in the previous section could bean example) Transformations that do not consider these constraints riskto compromise the database introducing inconsistencies in the database

The results of these experiments magnify the fact that in the context ofprivacy preserving data mining only a little portion of the parameters showedin table 11 are of interest More in details are relevant the parameters allowingto capture a possible variation of meaning or that are indirectly related withthe meaning of the information stored in a target database Therefore weidentify as most appropriate DQ parameters for PPDM Algorithms the followingdimensions

bull Accuracy it measures the proximity of a sanitized value arsquo to the originalvalue a In an informal way we say that a tuple t is accurate if thesanitization has not modified the attributes contained in the tuple t

bull Completeness it evaluates the percentage of data from the original data-base that are missing from the sanitized database Therefore a tuple t iscomplete if after the sanitization every attribute is not empty

bull Consistency it is related to the semantic constraints holding on the dataand it measures how many of these constraints are still satisfied after thesanitization

We now present the formal definitions of those parameters for use in theremainder of the discussion

Let OD be the original database and SD be the sanitized database resultingfrom the application of the PPDM algorithm Without loosing generality andin order to make simpler the following definitions we assume that OD (and con-sequently SD) be composed by a single relation We also adopt the positionalnotation to denote attributes in relations Thus let odi (sdi) be the i-th tuplein OD (SD) then odik (sdik) denotes the kth attribute of odi (sdi) Moreoverlet n be the total number of the attributes of interest we assume that attributesin positions 1 m (m le n) are the primary key attributes of the relation

Definition 2

Let sdj be a tuple of SD We say that sdj is Accurate if notexistodi isin OD suchthat ((odik = sdjk)forallk = 1m and exist(odif 6= sdjf ) (sdjf 6= NULL) f = m +1 n))Definition 3

A sdj is Complete if (existodi isin OD such that (odik = sdjk)forallk = 1m) and

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 3: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

Privacy Preserving Data Mining A Data Quality

Approach

Igor Nai Fovino Marcelo Masera

16th January 2008

2

Contents

Introduction 5

1 Data Quality Concepts 7

2 The Data Quality Scheme 11

201 An IQM Example 13

3 Data Quality Based Algorithms 19

31 ConMine Algorithms 2032 Problem Formulation 2033 Rule hiding algorithms based on data deletion 2134 Rule hiding algorithms based on data fuzzification 2135 Considerations on Codmine Association Rule Algorithms 2536 The Data Quality Strategy (First Part) 2637 Data Quality Distortion Based Algorithm 2838 Data Quality Based Blocking Algorithm 3139 The Data Quality Strategy (Second Part) 32310 Pre-Hiding Phase 35311 Multi Rules Zero Impact Hiding 36312 Multi Rule Low Impact algorithm 38

4 Conclusions 41

3

4 CONTENTS

Introduction

IntroductionDifferent approaches have been introduced by researchers in the field of pri-

vacy preserving data mining These approaches have in general the advantageto require a minimum amount of input (usually the database the information toprotect and few other parameters) and then a low effort is required to the userin order to apply them However some tests performed by Bertino Nai andParasiliti [1097] show that actually some problems occur when applying PPDMalgorithms in contexts in which the meaning of the sanitized data has a criticalrelevance More specifically these tests (see Appendix A for more details) showthat the sanitization often introduces false information or completely changesthe meaning of the information These phenomenas are due to the fact thatcurrent approaches to PPDM algorithms do not take into account two relevantaspects

bull Relevance of data not all the information stored in the database hasthe same level of relevance and not all the information can be dealt inthe same way An example may clarify this concept Consider a categoricdatabase1 and assume we wish to hide the rule AB rarr CD A privacy-preserving association mining algorithm would retrieve all transactionssupporting this rule and then change some items (eg item C) in thesetransactions in order to change the confidence of the rule Clearly if theitem C has low relevance the sanitization will not affect much the qualityof the data If however item C represents some critical information thesanitization can result in severe damages to the data

bull Structure of the database information stored in a database is stronglyinfluenced by the relationships between the different data items Theserelationships are not always explicit Consider as an example an HealthCare database storing patientrsquos records The medicines taken by patientsare part of this database Consider two different medicines say A and B

and assume that the following two constraints are defined ldquoThe maximumamount of medicine A per day must be 10ml If the patient also takes

1We define a categoric database a database in which attributes have a finite (but eventuallylarge) number of distinct values with no ordering among the values An example is the MarketBasket database

5

6 CONTENTS

medicine B the max amount of A must be 5mlrdquo If a PPDM algorithmmodifies the data concerning to the amount of medicine A without takinginto account these constraints the effects on the health of the patientcould be potentially catastrophic

Analyzing these observations we argue that the general problem of currentPPDM algorithms is related to Data Quality In [70] a well known work in thedata quality context Orr notes that data quality and data privacy are in someway often related More precisely an high quality of data often corresponds alow level of privacy It is not simple to explain this correlation We try in thisintroduction to give an intuitive explanation of this fact in order to give an ideaof our intuition

Any ldquoSanitizationrdquo on a target database has an effect on the data qualityof the database It is obvious to deduce that in order to preserve the qualityof the database it is necessary to take into consideration some function Π(DB)representing the relevance and the use of the data contained in the DB Nowby taking another little step in the discussion it is evident that the functionΠ(DB) has a different output for different databases2 This implies that aPPDM algorithm trying to preserve the data quality needs to know specificinformation about the particular database to be protected before executing thesanitizationWe believe then that the data quality is particularly important for the case ofcritical database that is database containing data that can be critical for thelife of the society (eg medical database Electrical control system databasemilitary Database etc) For this reason data quality must be considered as thecentral point of every type of data sanitizationIn what follows we present a more detailed definition of DQ We introduce aschema able to represent the Π function and we present two algorithms basedon DQ concept

2We use here the term ldquodifferentrdquo in its wide sense ie different meaning of date differentuse of data different database architecture and schema etc

Chapter 1

Data Quality Concepts

Traditionally DQ is a measure of the consistency between the data views pre-sented by an information system and the same data in the real-world [70] Thisdefinition is strongly related with the classical definition of information systemas a ldquomodel of a finite subset of the real worldrdquo [56] More in detail Levitinand Redman [59] claim that DQ is the instrument by which it is possible toevaluate if data models are well defined and data values accurate The mainproblem with DQ is that its evaluation is relative [94] in that it usually de-pends from the context in which data are used DQ can be assumed as acomplex characteristic of the data itself In the scientific literature DQ is con-sidered a multi-dimensional concept that in some environments involves bothobjective and subjective parameters [8104105] Table 1 lists of the most usedDQ parameters [104]

Accuracy Format Comparability Precision ClarityReliability Interpretability Conciseness Flexybility UsefulnessTimeliness Content Freedom from bias Understandability QuantitativenessRelevance Efficiency Informativeness Currency SufficiencyCompleteness Importance Level of Detail consistence Usableness

Table 11 Data Quality parameters

In the context of PPDM DQ has a similar meaning with however an impor-tant difference The difference is that in the case of PPDM the real world is theoriginal database and we want to measure how closely the sanitized databaseis to the original one with respect to some relevant properties In other wordswe are interested in assessing whether given a target database the sanitizationphase will compromise the quality of the mining results that can be obtainedfrom the sanitized database In order to understand better how to measure DQwe introduce a formalization of the sanitization phase In a PPDM process agiven database DB is modified in order to obtain a new database DB prime

7

8 CHAPTER 1 DATA QUALITY CONCEPTS

Definition 1

Let DB be the database to be sanitized Let ΓDB be the set of all the aggregateinformation contained in DB Let ΞDB sube ΓDB be the set of the sensitive in-formation to hide A transformation ξ D rarr D where D is the set of possibleinstances of a DB schema is a perfect sanitization if Γξ(DB) = ΓDB minus ΞDB

In other words the ideal sanitization is the one that completely removes thesensitive high level information while preserving at the same time the otherinformation This would be possible in the case in which constraints and rela-tionship between the data and between the information contained in the data-base do not exist or roughly speaking assuming the information to be a simpleaggregation of data if the intersection between the different information is al-ways equal to empty However this hypothetical scenario due to the complexityof modern databases in not possible In fact as explained in the introductionof this section we must take into account not only the high level informationcontained in the database but we must also consider the relationships betweendifferent information or different data that is the Π function we mentioned inthe previous section In order to identify the possible parameters that can beused to assess the DQ in the context of PPDM Algorithms we performed a setof tests More in detail we perform the following operations

1 We built a set of different types of database with different data structuresand containing information completely different from the meaning and theusefulness point of view

2 We identified a set of information judged relevant for every type of data-base

3 We applied a set of PPDM algorithms to every database in order to protectthe relevant information

Observing the results we identified the following four categories of damages(or loss of quality) to the informative asset of the database

bull Ambiguous transformation This is the case in which the transforma-tion introduces some uncertainty in the non sensitive information thatcan be then misinterpreted It is for example the case of aggregation algo-rithms and Perturbation algorithms It can be viewed even as a precisionlack when for example some numerical data are standardized in order tohide some information

bull Incomplete transformation The sanitized database results incom-plete More specifically some values of the attributes contained in thedatabase are marked as ldquoBlankrdquo For this reason information may resultincomplete and cannot be used This is typically the effect of the BlockingPPDM algorithm class

bull Meaningless transformation In this case the sanitized database con-tains information without meaning That happen in many cases when

9

perturbation or heuristic-based algorithms are applied without taking intoaccount the meaning of the information stored into the database

bull Implicit Constraints Violation Every database is characterized bysome implicit constraints derived from the external world that are notdirectly represented in the database in terms of structure and relation (theexample of medicines A and B presented in the previous section could bean example) Transformations that do not consider these constraints riskto compromise the database introducing inconsistencies in the database

The results of these experiments magnify the fact that in the context ofprivacy preserving data mining only a little portion of the parameters showedin table 11 are of interest More in details are relevant the parameters allowingto capture a possible variation of meaning or that are indirectly related withthe meaning of the information stored in a target database Therefore weidentify as most appropriate DQ parameters for PPDM Algorithms the followingdimensions

bull Accuracy it measures the proximity of a sanitized value arsquo to the originalvalue a In an informal way we say that a tuple t is accurate if thesanitization has not modified the attributes contained in the tuple t

bull Completeness it evaluates the percentage of data from the original data-base that are missing from the sanitized database Therefore a tuple t iscomplete if after the sanitization every attribute is not empty

bull Consistency it is related to the semantic constraints holding on the dataand it measures how many of these constraints are still satisfied after thesanitization

We now present the formal definitions of those parameters for use in theremainder of the discussion

Let OD be the original database and SD be the sanitized database resultingfrom the application of the PPDM algorithm Without loosing generality andin order to make simpler the following definitions we assume that OD (and con-sequently SD) be composed by a single relation We also adopt the positionalnotation to denote attributes in relations Thus let odi (sdi) be the i-th tuplein OD (SD) then odik (sdik) denotes the kth attribute of odi (sdi) Moreoverlet n be the total number of the attributes of interest we assume that attributesin positions 1 m (m le n) are the primary key attributes of the relation

Definition 2

Let sdj be a tuple of SD We say that sdj is Accurate if notexistodi isin OD suchthat ((odik = sdjk)forallk = 1m and exist(odif 6= sdjf ) (sdjf 6= NULL) f = m +1 n))Definition 3

A sdj is Complete if (existodi isin OD such that (odik = sdjk)forallk = 1m) and

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 4: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

2

Contents

Introduction 5

1 Data Quality Concepts 7

2 The Data Quality Scheme 11

201 An IQM Example 13

3 Data Quality Based Algorithms 19

31 ConMine Algorithms 2032 Problem Formulation 2033 Rule hiding algorithms based on data deletion 2134 Rule hiding algorithms based on data fuzzification 2135 Considerations on Codmine Association Rule Algorithms 2536 The Data Quality Strategy (First Part) 2637 Data Quality Distortion Based Algorithm 2838 Data Quality Based Blocking Algorithm 3139 The Data Quality Strategy (Second Part) 32310 Pre-Hiding Phase 35311 Multi Rules Zero Impact Hiding 36312 Multi Rule Low Impact algorithm 38

4 Conclusions 41

3

4 CONTENTS

Introduction

IntroductionDifferent approaches have been introduced by researchers in the field of pri-

vacy preserving data mining These approaches have in general the advantageto require a minimum amount of input (usually the database the information toprotect and few other parameters) and then a low effort is required to the userin order to apply them However some tests performed by Bertino Nai andParasiliti [1097] show that actually some problems occur when applying PPDMalgorithms in contexts in which the meaning of the sanitized data has a criticalrelevance More specifically these tests (see Appendix A for more details) showthat the sanitization often introduces false information or completely changesthe meaning of the information These phenomenas are due to the fact thatcurrent approaches to PPDM algorithms do not take into account two relevantaspects

bull Relevance of data not all the information stored in the database hasthe same level of relevance and not all the information can be dealt inthe same way An example may clarify this concept Consider a categoricdatabase1 and assume we wish to hide the rule AB rarr CD A privacy-preserving association mining algorithm would retrieve all transactionssupporting this rule and then change some items (eg item C) in thesetransactions in order to change the confidence of the rule Clearly if theitem C has low relevance the sanitization will not affect much the qualityof the data If however item C represents some critical information thesanitization can result in severe damages to the data

bull Structure of the database information stored in a database is stronglyinfluenced by the relationships between the different data items Theserelationships are not always explicit Consider as an example an HealthCare database storing patientrsquos records The medicines taken by patientsare part of this database Consider two different medicines say A and B

and assume that the following two constraints are defined ldquoThe maximumamount of medicine A per day must be 10ml If the patient also takes

1We define a categoric database a database in which attributes have a finite (but eventuallylarge) number of distinct values with no ordering among the values An example is the MarketBasket database

5

6 CONTENTS

medicine B the max amount of A must be 5mlrdquo If a PPDM algorithmmodifies the data concerning to the amount of medicine A without takinginto account these constraints the effects on the health of the patientcould be potentially catastrophic

Analyzing these observations we argue that the general problem of currentPPDM algorithms is related to Data Quality In [70] a well known work in thedata quality context Orr notes that data quality and data privacy are in someway often related More precisely an high quality of data often corresponds alow level of privacy It is not simple to explain this correlation We try in thisintroduction to give an intuitive explanation of this fact in order to give an ideaof our intuition

Any ldquoSanitizationrdquo on a target database has an effect on the data qualityof the database It is obvious to deduce that in order to preserve the qualityof the database it is necessary to take into consideration some function Π(DB)representing the relevance and the use of the data contained in the DB Nowby taking another little step in the discussion it is evident that the functionΠ(DB) has a different output for different databases2 This implies that aPPDM algorithm trying to preserve the data quality needs to know specificinformation about the particular database to be protected before executing thesanitizationWe believe then that the data quality is particularly important for the case ofcritical database that is database containing data that can be critical for thelife of the society (eg medical database Electrical control system databasemilitary Database etc) For this reason data quality must be considered as thecentral point of every type of data sanitizationIn what follows we present a more detailed definition of DQ We introduce aschema able to represent the Π function and we present two algorithms basedon DQ concept

2We use here the term ldquodifferentrdquo in its wide sense ie different meaning of date differentuse of data different database architecture and schema etc

Chapter 1

Data Quality Concepts

Traditionally DQ is a measure of the consistency between the data views pre-sented by an information system and the same data in the real-world [70] Thisdefinition is strongly related with the classical definition of information systemas a ldquomodel of a finite subset of the real worldrdquo [56] More in detail Levitinand Redman [59] claim that DQ is the instrument by which it is possible toevaluate if data models are well defined and data values accurate The mainproblem with DQ is that its evaluation is relative [94] in that it usually de-pends from the context in which data are used DQ can be assumed as acomplex characteristic of the data itself In the scientific literature DQ is con-sidered a multi-dimensional concept that in some environments involves bothobjective and subjective parameters [8104105] Table 1 lists of the most usedDQ parameters [104]

Accuracy Format Comparability Precision ClarityReliability Interpretability Conciseness Flexybility UsefulnessTimeliness Content Freedom from bias Understandability QuantitativenessRelevance Efficiency Informativeness Currency SufficiencyCompleteness Importance Level of Detail consistence Usableness

Table 11 Data Quality parameters

In the context of PPDM DQ has a similar meaning with however an impor-tant difference The difference is that in the case of PPDM the real world is theoriginal database and we want to measure how closely the sanitized databaseis to the original one with respect to some relevant properties In other wordswe are interested in assessing whether given a target database the sanitizationphase will compromise the quality of the mining results that can be obtainedfrom the sanitized database In order to understand better how to measure DQwe introduce a formalization of the sanitization phase In a PPDM process agiven database DB is modified in order to obtain a new database DB prime

7

8 CHAPTER 1 DATA QUALITY CONCEPTS

Definition 1

Let DB be the database to be sanitized Let ΓDB be the set of all the aggregateinformation contained in DB Let ΞDB sube ΓDB be the set of the sensitive in-formation to hide A transformation ξ D rarr D where D is the set of possibleinstances of a DB schema is a perfect sanitization if Γξ(DB) = ΓDB minus ΞDB

In other words the ideal sanitization is the one that completely removes thesensitive high level information while preserving at the same time the otherinformation This would be possible in the case in which constraints and rela-tionship between the data and between the information contained in the data-base do not exist or roughly speaking assuming the information to be a simpleaggregation of data if the intersection between the different information is al-ways equal to empty However this hypothetical scenario due to the complexityof modern databases in not possible In fact as explained in the introductionof this section we must take into account not only the high level informationcontained in the database but we must also consider the relationships betweendifferent information or different data that is the Π function we mentioned inthe previous section In order to identify the possible parameters that can beused to assess the DQ in the context of PPDM Algorithms we performed a setof tests More in detail we perform the following operations

1 We built a set of different types of database with different data structuresand containing information completely different from the meaning and theusefulness point of view

2 We identified a set of information judged relevant for every type of data-base

3 We applied a set of PPDM algorithms to every database in order to protectthe relevant information

Observing the results we identified the following four categories of damages(or loss of quality) to the informative asset of the database

bull Ambiguous transformation This is the case in which the transforma-tion introduces some uncertainty in the non sensitive information thatcan be then misinterpreted It is for example the case of aggregation algo-rithms and Perturbation algorithms It can be viewed even as a precisionlack when for example some numerical data are standardized in order tohide some information

bull Incomplete transformation The sanitized database results incom-plete More specifically some values of the attributes contained in thedatabase are marked as ldquoBlankrdquo For this reason information may resultincomplete and cannot be used This is typically the effect of the BlockingPPDM algorithm class

bull Meaningless transformation In this case the sanitized database con-tains information without meaning That happen in many cases when

9

perturbation or heuristic-based algorithms are applied without taking intoaccount the meaning of the information stored into the database

bull Implicit Constraints Violation Every database is characterized bysome implicit constraints derived from the external world that are notdirectly represented in the database in terms of structure and relation (theexample of medicines A and B presented in the previous section could bean example) Transformations that do not consider these constraints riskto compromise the database introducing inconsistencies in the database

The results of these experiments magnify the fact that in the context ofprivacy preserving data mining only a little portion of the parameters showedin table 11 are of interest More in details are relevant the parameters allowingto capture a possible variation of meaning or that are indirectly related withthe meaning of the information stored in a target database Therefore weidentify as most appropriate DQ parameters for PPDM Algorithms the followingdimensions

bull Accuracy it measures the proximity of a sanitized value arsquo to the originalvalue a In an informal way we say that a tuple t is accurate if thesanitization has not modified the attributes contained in the tuple t

bull Completeness it evaluates the percentage of data from the original data-base that are missing from the sanitized database Therefore a tuple t iscomplete if after the sanitization every attribute is not empty

bull Consistency it is related to the semantic constraints holding on the dataand it measures how many of these constraints are still satisfied after thesanitization

We now present the formal definitions of those parameters for use in theremainder of the discussion

Let OD be the original database and SD be the sanitized database resultingfrom the application of the PPDM algorithm Without loosing generality andin order to make simpler the following definitions we assume that OD (and con-sequently SD) be composed by a single relation We also adopt the positionalnotation to denote attributes in relations Thus let odi (sdi) be the i-th tuplein OD (SD) then odik (sdik) denotes the kth attribute of odi (sdi) Moreoverlet n be the total number of the attributes of interest we assume that attributesin positions 1 m (m le n) are the primary key attributes of the relation

Definition 2

Let sdj be a tuple of SD We say that sdj is Accurate if notexistodi isin OD suchthat ((odik = sdjk)forallk = 1m and exist(odif 6= sdjf ) (sdjf 6= NULL) f = m +1 n))Definition 3

A sdj is Complete if (existodi isin OD such that (odik = sdjk)forallk = 1m) and

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 5: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

Contents

Introduction 5

1 Data Quality Concepts 7

2 The Data Quality Scheme 11

201 An IQM Example 13

3 Data Quality Based Algorithms 19

31 ConMine Algorithms 2032 Problem Formulation 2033 Rule hiding algorithms based on data deletion 2134 Rule hiding algorithms based on data fuzzification 2135 Considerations on Codmine Association Rule Algorithms 2536 The Data Quality Strategy (First Part) 2637 Data Quality Distortion Based Algorithm 2838 Data Quality Based Blocking Algorithm 3139 The Data Quality Strategy (Second Part) 32310 Pre-Hiding Phase 35311 Multi Rules Zero Impact Hiding 36312 Multi Rule Low Impact algorithm 38

4 Conclusions 41

3

4 CONTENTS

Introduction

IntroductionDifferent approaches have been introduced by researchers in the field of pri-

vacy preserving data mining These approaches have in general the advantageto require a minimum amount of input (usually the database the information toprotect and few other parameters) and then a low effort is required to the userin order to apply them However some tests performed by Bertino Nai andParasiliti [1097] show that actually some problems occur when applying PPDMalgorithms in contexts in which the meaning of the sanitized data has a criticalrelevance More specifically these tests (see Appendix A for more details) showthat the sanitization often introduces false information or completely changesthe meaning of the information These phenomenas are due to the fact thatcurrent approaches to PPDM algorithms do not take into account two relevantaspects

bull Relevance of data not all the information stored in the database hasthe same level of relevance and not all the information can be dealt inthe same way An example may clarify this concept Consider a categoricdatabase1 and assume we wish to hide the rule AB rarr CD A privacy-preserving association mining algorithm would retrieve all transactionssupporting this rule and then change some items (eg item C) in thesetransactions in order to change the confidence of the rule Clearly if theitem C has low relevance the sanitization will not affect much the qualityof the data If however item C represents some critical information thesanitization can result in severe damages to the data

bull Structure of the database information stored in a database is stronglyinfluenced by the relationships between the different data items Theserelationships are not always explicit Consider as an example an HealthCare database storing patientrsquos records The medicines taken by patientsare part of this database Consider two different medicines say A and B

and assume that the following two constraints are defined ldquoThe maximumamount of medicine A per day must be 10ml If the patient also takes

1We define a categoric database a database in which attributes have a finite (but eventuallylarge) number of distinct values with no ordering among the values An example is the MarketBasket database

5

6 CONTENTS

medicine B the max amount of A must be 5mlrdquo If a PPDM algorithmmodifies the data concerning to the amount of medicine A without takinginto account these constraints the effects on the health of the patientcould be potentially catastrophic

Analyzing these observations we argue that the general problem of currentPPDM algorithms is related to Data Quality In [70] a well known work in thedata quality context Orr notes that data quality and data privacy are in someway often related More precisely an high quality of data often corresponds alow level of privacy It is not simple to explain this correlation We try in thisintroduction to give an intuitive explanation of this fact in order to give an ideaof our intuition

Any ldquoSanitizationrdquo on a target database has an effect on the data qualityof the database It is obvious to deduce that in order to preserve the qualityof the database it is necessary to take into consideration some function Π(DB)representing the relevance and the use of the data contained in the DB Nowby taking another little step in the discussion it is evident that the functionΠ(DB) has a different output for different databases2 This implies that aPPDM algorithm trying to preserve the data quality needs to know specificinformation about the particular database to be protected before executing thesanitizationWe believe then that the data quality is particularly important for the case ofcritical database that is database containing data that can be critical for thelife of the society (eg medical database Electrical control system databasemilitary Database etc) For this reason data quality must be considered as thecentral point of every type of data sanitizationIn what follows we present a more detailed definition of DQ We introduce aschema able to represent the Π function and we present two algorithms basedon DQ concept

2We use here the term ldquodifferentrdquo in its wide sense ie different meaning of date differentuse of data different database architecture and schema etc

Chapter 1

Data Quality Concepts

Traditionally DQ is a measure of the consistency between the data views pre-sented by an information system and the same data in the real-world [70] Thisdefinition is strongly related with the classical definition of information systemas a ldquomodel of a finite subset of the real worldrdquo [56] More in detail Levitinand Redman [59] claim that DQ is the instrument by which it is possible toevaluate if data models are well defined and data values accurate The mainproblem with DQ is that its evaluation is relative [94] in that it usually de-pends from the context in which data are used DQ can be assumed as acomplex characteristic of the data itself In the scientific literature DQ is con-sidered a multi-dimensional concept that in some environments involves bothobjective and subjective parameters [8104105] Table 1 lists of the most usedDQ parameters [104]

Accuracy Format Comparability Precision ClarityReliability Interpretability Conciseness Flexybility UsefulnessTimeliness Content Freedom from bias Understandability QuantitativenessRelevance Efficiency Informativeness Currency SufficiencyCompleteness Importance Level of Detail consistence Usableness

Table 11 Data Quality parameters

In the context of PPDM DQ has a similar meaning with however an impor-tant difference The difference is that in the case of PPDM the real world is theoriginal database and we want to measure how closely the sanitized databaseis to the original one with respect to some relevant properties In other wordswe are interested in assessing whether given a target database the sanitizationphase will compromise the quality of the mining results that can be obtainedfrom the sanitized database In order to understand better how to measure DQwe introduce a formalization of the sanitization phase In a PPDM process agiven database DB is modified in order to obtain a new database DB prime

7

8 CHAPTER 1 DATA QUALITY CONCEPTS

Definition 1

Let DB be the database to be sanitized Let ΓDB be the set of all the aggregateinformation contained in DB Let ΞDB sube ΓDB be the set of the sensitive in-formation to hide A transformation ξ D rarr D where D is the set of possibleinstances of a DB schema is a perfect sanitization if Γξ(DB) = ΓDB minus ΞDB

In other words the ideal sanitization is the one that completely removes thesensitive high level information while preserving at the same time the otherinformation This would be possible in the case in which constraints and rela-tionship between the data and between the information contained in the data-base do not exist or roughly speaking assuming the information to be a simpleaggregation of data if the intersection between the different information is al-ways equal to empty However this hypothetical scenario due to the complexityof modern databases in not possible In fact as explained in the introductionof this section we must take into account not only the high level informationcontained in the database but we must also consider the relationships betweendifferent information or different data that is the Π function we mentioned inthe previous section In order to identify the possible parameters that can beused to assess the DQ in the context of PPDM Algorithms we performed a setof tests More in detail we perform the following operations

1 We built a set of different types of database with different data structuresand containing information completely different from the meaning and theusefulness point of view

2 We identified a set of information judged relevant for every type of data-base

3 We applied a set of PPDM algorithms to every database in order to protectthe relevant information

Observing the results we identified the following four categories of damages(or loss of quality) to the informative asset of the database

bull Ambiguous transformation This is the case in which the transforma-tion introduces some uncertainty in the non sensitive information thatcan be then misinterpreted It is for example the case of aggregation algo-rithms and Perturbation algorithms It can be viewed even as a precisionlack when for example some numerical data are standardized in order tohide some information

bull Incomplete transformation The sanitized database results incom-plete More specifically some values of the attributes contained in thedatabase are marked as ldquoBlankrdquo For this reason information may resultincomplete and cannot be used This is typically the effect of the BlockingPPDM algorithm class

bull Meaningless transformation In this case the sanitized database con-tains information without meaning That happen in many cases when

9

perturbation or heuristic-based algorithms are applied without taking intoaccount the meaning of the information stored into the database

bull Implicit Constraints Violation Every database is characterized bysome implicit constraints derived from the external world that are notdirectly represented in the database in terms of structure and relation (theexample of medicines A and B presented in the previous section could bean example) Transformations that do not consider these constraints riskto compromise the database introducing inconsistencies in the database

The results of these experiments magnify the fact that in the context ofprivacy preserving data mining only a little portion of the parameters showedin table 11 are of interest More in details are relevant the parameters allowingto capture a possible variation of meaning or that are indirectly related withthe meaning of the information stored in a target database Therefore weidentify as most appropriate DQ parameters for PPDM Algorithms the followingdimensions

bull Accuracy it measures the proximity of a sanitized value arsquo to the originalvalue a In an informal way we say that a tuple t is accurate if thesanitization has not modified the attributes contained in the tuple t

bull Completeness it evaluates the percentage of data from the original data-base that are missing from the sanitized database Therefore a tuple t iscomplete if after the sanitization every attribute is not empty

bull Consistency it is related to the semantic constraints holding on the dataand it measures how many of these constraints are still satisfied after thesanitization

We now present the formal definitions of those parameters for use in theremainder of the discussion

Let OD be the original database and SD be the sanitized database resultingfrom the application of the PPDM algorithm Without loosing generality andin order to make simpler the following definitions we assume that OD (and con-sequently SD) be composed by a single relation We also adopt the positionalnotation to denote attributes in relations Thus let odi (sdi) be the i-th tuplein OD (SD) then odik (sdik) denotes the kth attribute of odi (sdi) Moreoverlet n be the total number of the attributes of interest we assume that attributesin positions 1 m (m le n) are the primary key attributes of the relation

Definition 2

Let sdj be a tuple of SD We say that sdj is Accurate if notexistodi isin OD suchthat ((odik = sdjk)forallk = 1m and exist(odif 6= sdjf ) (sdjf 6= NULL) f = m +1 n))Definition 3

A sdj is Complete if (existodi isin OD such that (odik = sdjk)forallk = 1m) and

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 6: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

4 CONTENTS

Introduction

IntroductionDifferent approaches have been introduced by researchers in the field of pri-

vacy preserving data mining These approaches have in general the advantageto require a minimum amount of input (usually the database the information toprotect and few other parameters) and then a low effort is required to the userin order to apply them However some tests performed by Bertino Nai andParasiliti [1097] show that actually some problems occur when applying PPDMalgorithms in contexts in which the meaning of the sanitized data has a criticalrelevance More specifically these tests (see Appendix A for more details) showthat the sanitization often introduces false information or completely changesthe meaning of the information These phenomenas are due to the fact thatcurrent approaches to PPDM algorithms do not take into account two relevantaspects

bull Relevance of data not all the information stored in the database hasthe same level of relevance and not all the information can be dealt inthe same way An example may clarify this concept Consider a categoricdatabase1 and assume we wish to hide the rule AB rarr CD A privacy-preserving association mining algorithm would retrieve all transactionssupporting this rule and then change some items (eg item C) in thesetransactions in order to change the confidence of the rule Clearly if theitem C has low relevance the sanitization will not affect much the qualityof the data If however item C represents some critical information thesanitization can result in severe damages to the data

bull Structure of the database information stored in a database is stronglyinfluenced by the relationships between the different data items Theserelationships are not always explicit Consider as an example an HealthCare database storing patientrsquos records The medicines taken by patientsare part of this database Consider two different medicines say A and B

and assume that the following two constraints are defined ldquoThe maximumamount of medicine A per day must be 10ml If the patient also takes

1We define a categoric database a database in which attributes have a finite (but eventuallylarge) number of distinct values with no ordering among the values An example is the MarketBasket database

5

6 CONTENTS

medicine B the max amount of A must be 5mlrdquo If a PPDM algorithmmodifies the data concerning to the amount of medicine A without takinginto account these constraints the effects on the health of the patientcould be potentially catastrophic

Analyzing these observations we argue that the general problem of currentPPDM algorithms is related to Data Quality In [70] a well known work in thedata quality context Orr notes that data quality and data privacy are in someway often related More precisely an high quality of data often corresponds alow level of privacy It is not simple to explain this correlation We try in thisintroduction to give an intuitive explanation of this fact in order to give an ideaof our intuition

Any ldquoSanitizationrdquo on a target database has an effect on the data qualityof the database It is obvious to deduce that in order to preserve the qualityof the database it is necessary to take into consideration some function Π(DB)representing the relevance and the use of the data contained in the DB Nowby taking another little step in the discussion it is evident that the functionΠ(DB) has a different output for different databases2 This implies that aPPDM algorithm trying to preserve the data quality needs to know specificinformation about the particular database to be protected before executing thesanitizationWe believe then that the data quality is particularly important for the case ofcritical database that is database containing data that can be critical for thelife of the society (eg medical database Electrical control system databasemilitary Database etc) For this reason data quality must be considered as thecentral point of every type of data sanitizationIn what follows we present a more detailed definition of DQ We introduce aschema able to represent the Π function and we present two algorithms basedon DQ concept

2We use here the term ldquodifferentrdquo in its wide sense ie different meaning of date differentuse of data different database architecture and schema etc

Chapter 1

Data Quality Concepts

Traditionally DQ is a measure of the consistency between the data views pre-sented by an information system and the same data in the real-world [70] Thisdefinition is strongly related with the classical definition of information systemas a ldquomodel of a finite subset of the real worldrdquo [56] More in detail Levitinand Redman [59] claim that DQ is the instrument by which it is possible toevaluate if data models are well defined and data values accurate The mainproblem with DQ is that its evaluation is relative [94] in that it usually de-pends from the context in which data are used DQ can be assumed as acomplex characteristic of the data itself In the scientific literature DQ is con-sidered a multi-dimensional concept that in some environments involves bothobjective and subjective parameters [8104105] Table 1 lists of the most usedDQ parameters [104]

Accuracy Format Comparability Precision ClarityReliability Interpretability Conciseness Flexybility UsefulnessTimeliness Content Freedom from bias Understandability QuantitativenessRelevance Efficiency Informativeness Currency SufficiencyCompleteness Importance Level of Detail consistence Usableness

Table 11 Data Quality parameters

In the context of PPDM DQ has a similar meaning with however an impor-tant difference The difference is that in the case of PPDM the real world is theoriginal database and we want to measure how closely the sanitized databaseis to the original one with respect to some relevant properties In other wordswe are interested in assessing whether given a target database the sanitizationphase will compromise the quality of the mining results that can be obtainedfrom the sanitized database In order to understand better how to measure DQwe introduce a formalization of the sanitization phase In a PPDM process agiven database DB is modified in order to obtain a new database DB prime

7

8 CHAPTER 1 DATA QUALITY CONCEPTS

Definition 1

Let DB be the database to be sanitized Let ΓDB be the set of all the aggregateinformation contained in DB Let ΞDB sube ΓDB be the set of the sensitive in-formation to hide A transformation ξ D rarr D where D is the set of possibleinstances of a DB schema is a perfect sanitization if Γξ(DB) = ΓDB minus ΞDB

In other words the ideal sanitization is the one that completely removes thesensitive high level information while preserving at the same time the otherinformation This would be possible in the case in which constraints and rela-tionship between the data and between the information contained in the data-base do not exist or roughly speaking assuming the information to be a simpleaggregation of data if the intersection between the different information is al-ways equal to empty However this hypothetical scenario due to the complexityof modern databases in not possible In fact as explained in the introductionof this section we must take into account not only the high level informationcontained in the database but we must also consider the relationships betweendifferent information or different data that is the Π function we mentioned inthe previous section In order to identify the possible parameters that can beused to assess the DQ in the context of PPDM Algorithms we performed a setof tests More in detail we perform the following operations

1 We built a set of different types of database with different data structuresand containing information completely different from the meaning and theusefulness point of view

2 We identified a set of information judged relevant for every type of data-base

3 We applied a set of PPDM algorithms to every database in order to protectthe relevant information

Observing the results we identified the following four categories of damages(or loss of quality) to the informative asset of the database

bull Ambiguous transformation This is the case in which the transforma-tion introduces some uncertainty in the non sensitive information thatcan be then misinterpreted It is for example the case of aggregation algo-rithms and Perturbation algorithms It can be viewed even as a precisionlack when for example some numerical data are standardized in order tohide some information

bull Incomplete transformation The sanitized database results incom-plete More specifically some values of the attributes contained in thedatabase are marked as ldquoBlankrdquo For this reason information may resultincomplete and cannot be used This is typically the effect of the BlockingPPDM algorithm class

bull Meaningless transformation In this case the sanitized database con-tains information without meaning That happen in many cases when

9

perturbation or heuristic-based algorithms are applied without taking intoaccount the meaning of the information stored into the database

bull Implicit Constraints Violation Every database is characterized bysome implicit constraints derived from the external world that are notdirectly represented in the database in terms of structure and relation (theexample of medicines A and B presented in the previous section could bean example) Transformations that do not consider these constraints riskto compromise the database introducing inconsistencies in the database

The results of these experiments magnify the fact that in the context ofprivacy preserving data mining only a little portion of the parameters showedin table 11 are of interest More in details are relevant the parameters allowingto capture a possible variation of meaning or that are indirectly related withthe meaning of the information stored in a target database Therefore weidentify as most appropriate DQ parameters for PPDM Algorithms the followingdimensions

bull Accuracy it measures the proximity of a sanitized value arsquo to the originalvalue a In an informal way we say that a tuple t is accurate if thesanitization has not modified the attributes contained in the tuple t

bull Completeness it evaluates the percentage of data from the original data-base that are missing from the sanitized database Therefore a tuple t iscomplete if after the sanitization every attribute is not empty

bull Consistency it is related to the semantic constraints holding on the dataand it measures how many of these constraints are still satisfied after thesanitization

We now present the formal definitions of those parameters for use in theremainder of the discussion

Let OD be the original database and SD be the sanitized database resultingfrom the application of the PPDM algorithm Without loosing generality andin order to make simpler the following definitions we assume that OD (and con-sequently SD) be composed by a single relation We also adopt the positionalnotation to denote attributes in relations Thus let odi (sdi) be the i-th tuplein OD (SD) then odik (sdik) denotes the kth attribute of odi (sdi) Moreoverlet n be the total number of the attributes of interest we assume that attributesin positions 1 m (m le n) are the primary key attributes of the relation

Definition 2

Let sdj be a tuple of SD We say that sdj is Accurate if notexistodi isin OD suchthat ((odik = sdjk)forallk = 1m and exist(odif 6= sdjf ) (sdjf 6= NULL) f = m +1 n))Definition 3

A sdj is Complete if (existodi isin OD such that (odik = sdjk)forallk = 1m) and

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 7: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

Introduction

IntroductionDifferent approaches have been introduced by researchers in the field of pri-

vacy preserving data mining These approaches have in general the advantageto require a minimum amount of input (usually the database the information toprotect and few other parameters) and then a low effort is required to the userin order to apply them However some tests performed by Bertino Nai andParasiliti [1097] show that actually some problems occur when applying PPDMalgorithms in contexts in which the meaning of the sanitized data has a criticalrelevance More specifically these tests (see Appendix A for more details) showthat the sanitization often introduces false information or completely changesthe meaning of the information These phenomenas are due to the fact thatcurrent approaches to PPDM algorithms do not take into account two relevantaspects

bull Relevance of data not all the information stored in the database hasthe same level of relevance and not all the information can be dealt inthe same way An example may clarify this concept Consider a categoricdatabase1 and assume we wish to hide the rule AB rarr CD A privacy-preserving association mining algorithm would retrieve all transactionssupporting this rule and then change some items (eg item C) in thesetransactions in order to change the confidence of the rule Clearly if theitem C has low relevance the sanitization will not affect much the qualityof the data If however item C represents some critical information thesanitization can result in severe damages to the data

bull Structure of the database information stored in a database is stronglyinfluenced by the relationships between the different data items Theserelationships are not always explicit Consider as an example an HealthCare database storing patientrsquos records The medicines taken by patientsare part of this database Consider two different medicines say A and B

and assume that the following two constraints are defined ldquoThe maximumamount of medicine A per day must be 10ml If the patient also takes

1We define a categoric database a database in which attributes have a finite (but eventuallylarge) number of distinct values with no ordering among the values An example is the MarketBasket database

5

6 CONTENTS

medicine B the max amount of A must be 5mlrdquo If a PPDM algorithmmodifies the data concerning to the amount of medicine A without takinginto account these constraints the effects on the health of the patientcould be potentially catastrophic

Analyzing these observations we argue that the general problem of currentPPDM algorithms is related to Data Quality In [70] a well known work in thedata quality context Orr notes that data quality and data privacy are in someway often related More precisely an high quality of data often corresponds alow level of privacy It is not simple to explain this correlation We try in thisintroduction to give an intuitive explanation of this fact in order to give an ideaof our intuition

Any ldquoSanitizationrdquo on a target database has an effect on the data qualityof the database It is obvious to deduce that in order to preserve the qualityof the database it is necessary to take into consideration some function Π(DB)representing the relevance and the use of the data contained in the DB Nowby taking another little step in the discussion it is evident that the functionΠ(DB) has a different output for different databases2 This implies that aPPDM algorithm trying to preserve the data quality needs to know specificinformation about the particular database to be protected before executing thesanitizationWe believe then that the data quality is particularly important for the case ofcritical database that is database containing data that can be critical for thelife of the society (eg medical database Electrical control system databasemilitary Database etc) For this reason data quality must be considered as thecentral point of every type of data sanitizationIn what follows we present a more detailed definition of DQ We introduce aschema able to represent the Π function and we present two algorithms basedon DQ concept

2We use here the term ldquodifferentrdquo in its wide sense ie different meaning of date differentuse of data different database architecture and schema etc

Chapter 1

Data Quality Concepts

Traditionally DQ is a measure of the consistency between the data views pre-sented by an information system and the same data in the real-world [70] Thisdefinition is strongly related with the classical definition of information systemas a ldquomodel of a finite subset of the real worldrdquo [56] More in detail Levitinand Redman [59] claim that DQ is the instrument by which it is possible toevaluate if data models are well defined and data values accurate The mainproblem with DQ is that its evaluation is relative [94] in that it usually de-pends from the context in which data are used DQ can be assumed as acomplex characteristic of the data itself In the scientific literature DQ is con-sidered a multi-dimensional concept that in some environments involves bothobjective and subjective parameters [8104105] Table 1 lists of the most usedDQ parameters [104]

Accuracy Format Comparability Precision ClarityReliability Interpretability Conciseness Flexybility UsefulnessTimeliness Content Freedom from bias Understandability QuantitativenessRelevance Efficiency Informativeness Currency SufficiencyCompleteness Importance Level of Detail consistence Usableness

Table 11 Data Quality parameters

In the context of PPDM DQ has a similar meaning with however an impor-tant difference The difference is that in the case of PPDM the real world is theoriginal database and we want to measure how closely the sanitized databaseis to the original one with respect to some relevant properties In other wordswe are interested in assessing whether given a target database the sanitizationphase will compromise the quality of the mining results that can be obtainedfrom the sanitized database In order to understand better how to measure DQwe introduce a formalization of the sanitization phase In a PPDM process agiven database DB is modified in order to obtain a new database DB prime

7

8 CHAPTER 1 DATA QUALITY CONCEPTS

Definition 1

Let DB be the database to be sanitized Let ΓDB be the set of all the aggregateinformation contained in DB Let ΞDB sube ΓDB be the set of the sensitive in-formation to hide A transformation ξ D rarr D where D is the set of possibleinstances of a DB schema is a perfect sanitization if Γξ(DB) = ΓDB minus ΞDB

In other words the ideal sanitization is the one that completely removes thesensitive high level information while preserving at the same time the otherinformation This would be possible in the case in which constraints and rela-tionship between the data and between the information contained in the data-base do not exist or roughly speaking assuming the information to be a simpleaggregation of data if the intersection between the different information is al-ways equal to empty However this hypothetical scenario due to the complexityof modern databases in not possible In fact as explained in the introductionof this section we must take into account not only the high level informationcontained in the database but we must also consider the relationships betweendifferent information or different data that is the Π function we mentioned inthe previous section In order to identify the possible parameters that can beused to assess the DQ in the context of PPDM Algorithms we performed a setof tests More in detail we perform the following operations

1 We built a set of different types of database with different data structuresand containing information completely different from the meaning and theusefulness point of view

2 We identified a set of information judged relevant for every type of data-base

3 We applied a set of PPDM algorithms to every database in order to protectthe relevant information

Observing the results we identified the following four categories of damages(or loss of quality) to the informative asset of the database

bull Ambiguous transformation This is the case in which the transforma-tion introduces some uncertainty in the non sensitive information thatcan be then misinterpreted It is for example the case of aggregation algo-rithms and Perturbation algorithms It can be viewed even as a precisionlack when for example some numerical data are standardized in order tohide some information

bull Incomplete transformation The sanitized database results incom-plete More specifically some values of the attributes contained in thedatabase are marked as ldquoBlankrdquo For this reason information may resultincomplete and cannot be used This is typically the effect of the BlockingPPDM algorithm class

bull Meaningless transformation In this case the sanitized database con-tains information without meaning That happen in many cases when

9

perturbation or heuristic-based algorithms are applied without taking intoaccount the meaning of the information stored into the database

bull Implicit Constraints Violation Every database is characterized bysome implicit constraints derived from the external world that are notdirectly represented in the database in terms of structure and relation (theexample of medicines A and B presented in the previous section could bean example) Transformations that do not consider these constraints riskto compromise the database introducing inconsistencies in the database

The results of these experiments magnify the fact that in the context ofprivacy preserving data mining only a little portion of the parameters showedin table 11 are of interest More in details are relevant the parameters allowingto capture a possible variation of meaning or that are indirectly related withthe meaning of the information stored in a target database Therefore weidentify as most appropriate DQ parameters for PPDM Algorithms the followingdimensions

bull Accuracy it measures the proximity of a sanitized value arsquo to the originalvalue a In an informal way we say that a tuple t is accurate if thesanitization has not modified the attributes contained in the tuple t

bull Completeness it evaluates the percentage of data from the original data-base that are missing from the sanitized database Therefore a tuple t iscomplete if after the sanitization every attribute is not empty

bull Consistency it is related to the semantic constraints holding on the dataand it measures how many of these constraints are still satisfied after thesanitization

We now present the formal definitions of those parameters for use in theremainder of the discussion

Let OD be the original database and SD be the sanitized database resultingfrom the application of the PPDM algorithm Without loosing generality andin order to make simpler the following definitions we assume that OD (and con-sequently SD) be composed by a single relation We also adopt the positionalnotation to denote attributes in relations Thus let odi (sdi) be the i-th tuplein OD (SD) then odik (sdik) denotes the kth attribute of odi (sdi) Moreoverlet n be the total number of the attributes of interest we assume that attributesin positions 1 m (m le n) are the primary key attributes of the relation

Definition 2

Let sdj be a tuple of SD We say that sdj is Accurate if notexistodi isin OD suchthat ((odik = sdjk)forallk = 1m and exist(odif 6= sdjf ) (sdjf 6= NULL) f = m +1 n))Definition 3

A sdj is Complete if (existodi isin OD such that (odik = sdjk)forallk = 1m) and

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 8: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

6 CONTENTS

medicine B the max amount of A must be 5mlrdquo If a PPDM algorithmmodifies the data concerning to the amount of medicine A without takinginto account these constraints the effects on the health of the patientcould be potentially catastrophic

Analyzing these observations we argue that the general problem of currentPPDM algorithms is related to Data Quality In [70] a well known work in thedata quality context Orr notes that data quality and data privacy are in someway often related More precisely an high quality of data often corresponds alow level of privacy It is not simple to explain this correlation We try in thisintroduction to give an intuitive explanation of this fact in order to give an ideaof our intuition

Any ldquoSanitizationrdquo on a target database has an effect on the data qualityof the database It is obvious to deduce that in order to preserve the qualityof the database it is necessary to take into consideration some function Π(DB)representing the relevance and the use of the data contained in the DB Nowby taking another little step in the discussion it is evident that the functionΠ(DB) has a different output for different databases2 This implies that aPPDM algorithm trying to preserve the data quality needs to know specificinformation about the particular database to be protected before executing thesanitizationWe believe then that the data quality is particularly important for the case ofcritical database that is database containing data that can be critical for thelife of the society (eg medical database Electrical control system databasemilitary Database etc) For this reason data quality must be considered as thecentral point of every type of data sanitizationIn what follows we present a more detailed definition of DQ We introduce aschema able to represent the Π function and we present two algorithms basedon DQ concept

2We use here the term ldquodifferentrdquo in its wide sense ie different meaning of date differentuse of data different database architecture and schema etc

Chapter 1

Data Quality Concepts

Traditionally DQ is a measure of the consistency between the data views pre-sented by an information system and the same data in the real-world [70] Thisdefinition is strongly related with the classical definition of information systemas a ldquomodel of a finite subset of the real worldrdquo [56] More in detail Levitinand Redman [59] claim that DQ is the instrument by which it is possible toevaluate if data models are well defined and data values accurate The mainproblem with DQ is that its evaluation is relative [94] in that it usually de-pends from the context in which data are used DQ can be assumed as acomplex characteristic of the data itself In the scientific literature DQ is con-sidered a multi-dimensional concept that in some environments involves bothobjective and subjective parameters [8104105] Table 1 lists of the most usedDQ parameters [104]

Accuracy Format Comparability Precision ClarityReliability Interpretability Conciseness Flexybility UsefulnessTimeliness Content Freedom from bias Understandability QuantitativenessRelevance Efficiency Informativeness Currency SufficiencyCompleteness Importance Level of Detail consistence Usableness

Table 11 Data Quality parameters

In the context of PPDM DQ has a similar meaning with however an impor-tant difference The difference is that in the case of PPDM the real world is theoriginal database and we want to measure how closely the sanitized databaseis to the original one with respect to some relevant properties In other wordswe are interested in assessing whether given a target database the sanitizationphase will compromise the quality of the mining results that can be obtainedfrom the sanitized database In order to understand better how to measure DQwe introduce a formalization of the sanitization phase In a PPDM process agiven database DB is modified in order to obtain a new database DB prime

7

8 CHAPTER 1 DATA QUALITY CONCEPTS

Definition 1

Let DB be the database to be sanitized Let ΓDB be the set of all the aggregateinformation contained in DB Let ΞDB sube ΓDB be the set of the sensitive in-formation to hide A transformation ξ D rarr D where D is the set of possibleinstances of a DB schema is a perfect sanitization if Γξ(DB) = ΓDB minus ΞDB

In other words the ideal sanitization is the one that completely removes thesensitive high level information while preserving at the same time the otherinformation This would be possible in the case in which constraints and rela-tionship between the data and between the information contained in the data-base do not exist or roughly speaking assuming the information to be a simpleaggregation of data if the intersection between the different information is al-ways equal to empty However this hypothetical scenario due to the complexityof modern databases in not possible In fact as explained in the introductionof this section we must take into account not only the high level informationcontained in the database but we must also consider the relationships betweendifferent information or different data that is the Π function we mentioned inthe previous section In order to identify the possible parameters that can beused to assess the DQ in the context of PPDM Algorithms we performed a setof tests More in detail we perform the following operations

1 We built a set of different types of database with different data structuresand containing information completely different from the meaning and theusefulness point of view

2 We identified a set of information judged relevant for every type of data-base

3 We applied a set of PPDM algorithms to every database in order to protectthe relevant information

Observing the results we identified the following four categories of damages(or loss of quality) to the informative asset of the database

bull Ambiguous transformation This is the case in which the transforma-tion introduces some uncertainty in the non sensitive information thatcan be then misinterpreted It is for example the case of aggregation algo-rithms and Perturbation algorithms It can be viewed even as a precisionlack when for example some numerical data are standardized in order tohide some information

bull Incomplete transformation The sanitized database results incom-plete More specifically some values of the attributes contained in thedatabase are marked as ldquoBlankrdquo For this reason information may resultincomplete and cannot be used This is typically the effect of the BlockingPPDM algorithm class

bull Meaningless transformation In this case the sanitized database con-tains information without meaning That happen in many cases when

9

perturbation or heuristic-based algorithms are applied without taking intoaccount the meaning of the information stored into the database

bull Implicit Constraints Violation Every database is characterized bysome implicit constraints derived from the external world that are notdirectly represented in the database in terms of structure and relation (theexample of medicines A and B presented in the previous section could bean example) Transformations that do not consider these constraints riskto compromise the database introducing inconsistencies in the database

The results of these experiments magnify the fact that in the context ofprivacy preserving data mining only a little portion of the parameters showedin table 11 are of interest More in details are relevant the parameters allowingto capture a possible variation of meaning or that are indirectly related withthe meaning of the information stored in a target database Therefore weidentify as most appropriate DQ parameters for PPDM Algorithms the followingdimensions

bull Accuracy it measures the proximity of a sanitized value arsquo to the originalvalue a In an informal way we say that a tuple t is accurate if thesanitization has not modified the attributes contained in the tuple t

bull Completeness it evaluates the percentage of data from the original data-base that are missing from the sanitized database Therefore a tuple t iscomplete if after the sanitization every attribute is not empty

bull Consistency it is related to the semantic constraints holding on the dataand it measures how many of these constraints are still satisfied after thesanitization

We now present the formal definitions of those parameters for use in theremainder of the discussion

Let OD be the original database and SD be the sanitized database resultingfrom the application of the PPDM algorithm Without loosing generality andin order to make simpler the following definitions we assume that OD (and con-sequently SD) be composed by a single relation We also adopt the positionalnotation to denote attributes in relations Thus let odi (sdi) be the i-th tuplein OD (SD) then odik (sdik) denotes the kth attribute of odi (sdi) Moreoverlet n be the total number of the attributes of interest we assume that attributesin positions 1 m (m le n) are the primary key attributes of the relation

Definition 2

Let sdj be a tuple of SD We say that sdj is Accurate if notexistodi isin OD suchthat ((odik = sdjk)forallk = 1m and exist(odif 6= sdjf ) (sdjf 6= NULL) f = m +1 n))Definition 3

A sdj is Complete if (existodi isin OD such that (odik = sdjk)forallk = 1m) and

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 9: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

Chapter 1

Data Quality Concepts

Traditionally DQ is a measure of the consistency between the data views pre-sented by an information system and the same data in the real-world [70] Thisdefinition is strongly related with the classical definition of information systemas a ldquomodel of a finite subset of the real worldrdquo [56] More in detail Levitinand Redman [59] claim that DQ is the instrument by which it is possible toevaluate if data models are well defined and data values accurate The mainproblem with DQ is that its evaluation is relative [94] in that it usually de-pends from the context in which data are used DQ can be assumed as acomplex characteristic of the data itself In the scientific literature DQ is con-sidered a multi-dimensional concept that in some environments involves bothobjective and subjective parameters [8104105] Table 1 lists of the most usedDQ parameters [104]

Accuracy Format Comparability Precision ClarityReliability Interpretability Conciseness Flexybility UsefulnessTimeliness Content Freedom from bias Understandability QuantitativenessRelevance Efficiency Informativeness Currency SufficiencyCompleteness Importance Level of Detail consistence Usableness

Table 11 Data Quality parameters

In the context of PPDM DQ has a similar meaning with however an impor-tant difference The difference is that in the case of PPDM the real world is theoriginal database and we want to measure how closely the sanitized databaseis to the original one with respect to some relevant properties In other wordswe are interested in assessing whether given a target database the sanitizationphase will compromise the quality of the mining results that can be obtainedfrom the sanitized database In order to understand better how to measure DQwe introduce a formalization of the sanitization phase In a PPDM process agiven database DB is modified in order to obtain a new database DB prime

7

8 CHAPTER 1 DATA QUALITY CONCEPTS

Definition 1

Let DB be the database to be sanitized Let ΓDB be the set of all the aggregateinformation contained in DB Let ΞDB sube ΓDB be the set of the sensitive in-formation to hide A transformation ξ D rarr D where D is the set of possibleinstances of a DB schema is a perfect sanitization if Γξ(DB) = ΓDB minus ΞDB

In other words the ideal sanitization is the one that completely removes thesensitive high level information while preserving at the same time the otherinformation This would be possible in the case in which constraints and rela-tionship between the data and between the information contained in the data-base do not exist or roughly speaking assuming the information to be a simpleaggregation of data if the intersection between the different information is al-ways equal to empty However this hypothetical scenario due to the complexityof modern databases in not possible In fact as explained in the introductionof this section we must take into account not only the high level informationcontained in the database but we must also consider the relationships betweendifferent information or different data that is the Π function we mentioned inthe previous section In order to identify the possible parameters that can beused to assess the DQ in the context of PPDM Algorithms we performed a setof tests More in detail we perform the following operations

1 We built a set of different types of database with different data structuresand containing information completely different from the meaning and theusefulness point of view

2 We identified a set of information judged relevant for every type of data-base

3 We applied a set of PPDM algorithms to every database in order to protectthe relevant information

Observing the results we identified the following four categories of damages(or loss of quality) to the informative asset of the database

bull Ambiguous transformation This is the case in which the transforma-tion introduces some uncertainty in the non sensitive information thatcan be then misinterpreted It is for example the case of aggregation algo-rithms and Perturbation algorithms It can be viewed even as a precisionlack when for example some numerical data are standardized in order tohide some information

bull Incomplete transformation The sanitized database results incom-plete More specifically some values of the attributes contained in thedatabase are marked as ldquoBlankrdquo For this reason information may resultincomplete and cannot be used This is typically the effect of the BlockingPPDM algorithm class

bull Meaningless transformation In this case the sanitized database con-tains information without meaning That happen in many cases when

9

perturbation or heuristic-based algorithms are applied without taking intoaccount the meaning of the information stored into the database

bull Implicit Constraints Violation Every database is characterized bysome implicit constraints derived from the external world that are notdirectly represented in the database in terms of structure and relation (theexample of medicines A and B presented in the previous section could bean example) Transformations that do not consider these constraints riskto compromise the database introducing inconsistencies in the database

The results of these experiments magnify the fact that in the context ofprivacy preserving data mining only a little portion of the parameters showedin table 11 are of interest More in details are relevant the parameters allowingto capture a possible variation of meaning or that are indirectly related withthe meaning of the information stored in a target database Therefore weidentify as most appropriate DQ parameters for PPDM Algorithms the followingdimensions

bull Accuracy it measures the proximity of a sanitized value arsquo to the originalvalue a In an informal way we say that a tuple t is accurate if thesanitization has not modified the attributes contained in the tuple t

bull Completeness it evaluates the percentage of data from the original data-base that are missing from the sanitized database Therefore a tuple t iscomplete if after the sanitization every attribute is not empty

bull Consistency it is related to the semantic constraints holding on the dataand it measures how many of these constraints are still satisfied after thesanitization

We now present the formal definitions of those parameters for use in theremainder of the discussion

Let OD be the original database and SD be the sanitized database resultingfrom the application of the PPDM algorithm Without loosing generality andin order to make simpler the following definitions we assume that OD (and con-sequently SD) be composed by a single relation We also adopt the positionalnotation to denote attributes in relations Thus let odi (sdi) be the i-th tuplein OD (SD) then odik (sdik) denotes the kth attribute of odi (sdi) Moreoverlet n be the total number of the attributes of interest we assume that attributesin positions 1 m (m le n) are the primary key attributes of the relation

Definition 2

Let sdj be a tuple of SD We say that sdj is Accurate if notexistodi isin OD suchthat ((odik = sdjk)forallk = 1m and exist(odif 6= sdjf ) (sdjf 6= NULL) f = m +1 n))Definition 3

A sdj is Complete if (existodi isin OD such that (odik = sdjk)forallk = 1m) and

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 10: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

8 CHAPTER 1 DATA QUALITY CONCEPTS

Definition 1

Let DB be the database to be sanitized Let ΓDB be the set of all the aggregateinformation contained in DB Let ΞDB sube ΓDB be the set of the sensitive in-formation to hide A transformation ξ D rarr D where D is the set of possibleinstances of a DB schema is a perfect sanitization if Γξ(DB) = ΓDB minus ΞDB

In other words the ideal sanitization is the one that completely removes thesensitive high level information while preserving at the same time the otherinformation This would be possible in the case in which constraints and rela-tionship between the data and between the information contained in the data-base do not exist or roughly speaking assuming the information to be a simpleaggregation of data if the intersection between the different information is al-ways equal to empty However this hypothetical scenario due to the complexityof modern databases in not possible In fact as explained in the introductionof this section we must take into account not only the high level informationcontained in the database but we must also consider the relationships betweendifferent information or different data that is the Π function we mentioned inthe previous section In order to identify the possible parameters that can beused to assess the DQ in the context of PPDM Algorithms we performed a setof tests More in detail we perform the following operations

1 We built a set of different types of database with different data structuresand containing information completely different from the meaning and theusefulness point of view

2 We identified a set of information judged relevant for every type of data-base

3 We applied a set of PPDM algorithms to every database in order to protectthe relevant information

Observing the results we identified the following four categories of damages(or loss of quality) to the informative asset of the database

bull Ambiguous transformation This is the case in which the transforma-tion introduces some uncertainty in the non sensitive information thatcan be then misinterpreted It is for example the case of aggregation algo-rithms and Perturbation algorithms It can be viewed even as a precisionlack when for example some numerical data are standardized in order tohide some information

bull Incomplete transformation The sanitized database results incom-plete More specifically some values of the attributes contained in thedatabase are marked as ldquoBlankrdquo For this reason information may resultincomplete and cannot be used This is typically the effect of the BlockingPPDM algorithm class

bull Meaningless transformation In this case the sanitized database con-tains information without meaning That happen in many cases when

9

perturbation or heuristic-based algorithms are applied without taking intoaccount the meaning of the information stored into the database

bull Implicit Constraints Violation Every database is characterized bysome implicit constraints derived from the external world that are notdirectly represented in the database in terms of structure and relation (theexample of medicines A and B presented in the previous section could bean example) Transformations that do not consider these constraints riskto compromise the database introducing inconsistencies in the database

The results of these experiments magnify the fact that in the context ofprivacy preserving data mining only a little portion of the parameters showedin table 11 are of interest More in details are relevant the parameters allowingto capture a possible variation of meaning or that are indirectly related withthe meaning of the information stored in a target database Therefore weidentify as most appropriate DQ parameters for PPDM Algorithms the followingdimensions

bull Accuracy it measures the proximity of a sanitized value arsquo to the originalvalue a In an informal way we say that a tuple t is accurate if thesanitization has not modified the attributes contained in the tuple t

bull Completeness it evaluates the percentage of data from the original data-base that are missing from the sanitized database Therefore a tuple t iscomplete if after the sanitization every attribute is not empty

bull Consistency it is related to the semantic constraints holding on the dataand it measures how many of these constraints are still satisfied after thesanitization

We now present the formal definitions of those parameters for use in theremainder of the discussion

Let OD be the original database and SD be the sanitized database resultingfrom the application of the PPDM algorithm Without loosing generality andin order to make simpler the following definitions we assume that OD (and con-sequently SD) be composed by a single relation We also adopt the positionalnotation to denote attributes in relations Thus let odi (sdi) be the i-th tuplein OD (SD) then odik (sdik) denotes the kth attribute of odi (sdi) Moreoverlet n be the total number of the attributes of interest we assume that attributesin positions 1 m (m le n) are the primary key attributes of the relation

Definition 2

Let sdj be a tuple of SD We say that sdj is Accurate if notexistodi isin OD suchthat ((odik = sdjk)forallk = 1m and exist(odif 6= sdjf ) (sdjf 6= NULL) f = m +1 n))Definition 3

A sdj is Complete if (existodi isin OD such that (odik = sdjk)forallk = 1m) and

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 11: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

9

perturbation or heuristic-based algorithms are applied without taking intoaccount the meaning of the information stored into the database

bull Implicit Constraints Violation Every database is characterized bysome implicit constraints derived from the external world that are notdirectly represented in the database in terms of structure and relation (theexample of medicines A and B presented in the previous section could bean example) Transformations that do not consider these constraints riskto compromise the database introducing inconsistencies in the database

The results of these experiments magnify the fact that in the context ofprivacy preserving data mining only a little portion of the parameters showedin table 11 are of interest More in details are relevant the parameters allowingto capture a possible variation of meaning or that are indirectly related withthe meaning of the information stored in a target database Therefore weidentify as most appropriate DQ parameters for PPDM Algorithms the followingdimensions

bull Accuracy it measures the proximity of a sanitized value arsquo to the originalvalue a In an informal way we say that a tuple t is accurate if thesanitization has not modified the attributes contained in the tuple t

bull Completeness it evaluates the percentage of data from the original data-base that are missing from the sanitized database Therefore a tuple t iscomplete if after the sanitization every attribute is not empty

bull Consistency it is related to the semantic constraints holding on the dataand it measures how many of these constraints are still satisfied after thesanitization

We now present the formal definitions of those parameters for use in theremainder of the discussion

Let OD be the original database and SD be the sanitized database resultingfrom the application of the PPDM algorithm Without loosing generality andin order to make simpler the following definitions we assume that OD (and con-sequently SD) be composed by a single relation We also adopt the positionalnotation to denote attributes in relations Thus let odi (sdi) be the i-th tuplein OD (SD) then odik (sdik) denotes the kth attribute of odi (sdi) Moreoverlet n be the total number of the attributes of interest we assume that attributesin positions 1 m (m le n) are the primary key attributes of the relation

Definition 2

Let sdj be a tuple of SD We say that sdj is Accurate if notexistodi isin OD suchthat ((odik = sdjk)forallk = 1m and exist(odif 6= sdjf ) (sdjf 6= NULL) f = m +1 n))Definition 3

A sdj is Complete if (existodi isin OD such that (odik = sdjk)forallk = 1m) and

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 12: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

10 CHAPTER 1 DATA QUALITY CONCEPTS

(notexist(sdjf = NULL) f = m + 1 n)

Where we intend for ldquoNULLrdquo value a value without meaning in a well definedcontext

Let C be the set of the constraints defined on database OD in what followswe denote with cij the jth constraint on attribute i We assume here constraintson a single attribute but it is easily possible to extend the measure to complexconstraints

Definition 4

A tuple sdk is Consistent if notexistcij isin C such that cij(sdki) = false i = 1nStarting from these definitions it is possible to introduce three metrics (reportedin Table 12) that allow us to measure the lack of accuracy completeness andconsistency More specifically we define the lack of accuracy as the proportionof non accurate items (ie the amount of items modified during the sanitiza-tion) with respect to the number of items contained in the sanitized databaseSimilarly the lack of completeness is defined as the proportion of non completeitems (ie the items substituted with a NULL value during the sanitization)with respect to the total number of items contained in the sanitized databaseThe consistency lack is simply defined as the number of constraints violationin SD due to the sanitization phase

Name Short Explanation Expression

Accuracy Lack The proportion of non accurate items in SD λSD = |SDnacc||SD|

Completeness Lack The proportion of non complete items in SD ϑSD = |SDnc||SD|

Consistency Lack The number of constraint violations in SD $SD = Nc

Table 12 The three parameters of interest in PPDM DQ evaluation In theexpressions SDnacc denoted the set of not accurate items SDnc denotes the setof not complete items and NC denotes the number of constraint violations

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 13: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

Chapter 2

The Data Quality Scheme

In the previous section we have presented a way to measure DQ in the sani-tized database These parameters are however not sufficient to help us in theconstruction of a new PPDM algorithm based on the preservation of DQ Asexplained in the introduction of this report the main problem of actual PPDMalgorithms is that they are not aware of the relevance of the information storedin the database nor of the relationships among these information

It is necessary to provide then a formal description that allow us to magnifythe aggregate information of interest for a target database and the relevance ofDQ properties for each aggregate information (for some aggregate informationnot all the DQ properties have the same relevance) Moreover for each aggre-gate information it is necessary to provide a structure in order to describe therelevance (in term of consistency completeness and accuracy level required) atthe attribute level and the constraints the attributes must satisfy The Infor-mation Quality Model (IQM) proposed here addresses this requirement

In the following we give a formal definition of Data Model Graph (DMG)(used to represent the attributes involved in an aggregate information and theirconstraints) and Aggregation Information Schema (AIS) Moreover we give adefinition for an Aggregate information Schema Set (ASSET) Before giving thedefinition of DMG AIS and ASSET we introduce some preliminary concepts

Definition 5

An Attribute Class is defined as the tuple ATC =lt nameAWAVCWCVCSV Slink gt

where

bull Name is the attribute id

bull AW is the accuracy weight for the target attribute

bull AV is the accuracy value

bull CW is the completeness weigh for the target attribute

bull CV is the completeness value

11

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 14: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

12 CHAPTER 2 THE DATA QUALITY SCHEME

bull CSV is the consistency value

bull Slink is list of simple constraints

An attribute class represents an attribute of a target database schema in-volved in the construction of a certain aggregate information for which we wantto evaluate the data quality The attributes however have a different relevancein an aggregate information For this reason a set of weights is associated toevery attribute class specifying for example if for a certain attribute a lack ofaccuracy must be considered as relevant damage or not

The attributes are the bricks of our structure However a simple list of at-tributes in not sufficient In fact in order to evaluate the consistency propertywe need to take in consideration also the relationships between the attributesThe relationships can be represented by logical constraints and the validationof these constraints give us a measure of the consistency of an aggregate infor-mation As in the case of the attributes also for the constraints we need tospecify their relevance (ie a violation of a constraint cause a negligible or asevere damage) Moreover there exists constraints involving at the same timemore then two attributes In what follows the definitions of simple constraintand complex constraint are showed

Definition 6

A Simple Constraint Class is defined as the tuple SCC =lt nameConstr CWClink CSV gt

where

bull Name is the constraint id

bull Constraint describes the constraint using some logic expression

bull CW is the weigh of the constraint It represents the relevance of thisconstraint in the AIS

bull CSV is the number of violations to the constraint

bull Clink it is the list of complex constraints defined on SCC

Definition 7

A Complex Constraint Class is defined as the tuple CCC =lt nameOperator

CWCSV SCC link gt where

bull Name is the Complex Constraint id

bull Operator is the ldquoMergingrdquo operator by which the simple constraints areused to build the complex one

bull CW is the weigh of the complex constraint

bull CSV is the number of violations

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 15: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

13

bull SCC link is the list of all the SCC that are related to the CCC

We have now the bricks and the glue of our structure A DMG is a collectionof attributes and constraints allowing one to describe an aggregate informationMore formally let D be a database

Definition 8

A DMG (Data Model Graph) is an oriented graph with the following features

bull A set of nodes NA where each node is an Attribute Class

bull A set of nodes SCC where each node describes a Simple Constraint Class

bull A set of nodes CCC where each node describes a Complex Constraint Class

bull A set of direct edges LNjNk LNjNk isin ((NAXSCC) cup (SCCXCCC) cup(SCCXNA) cup (CCCXNA))

As is the case of attributes and constraints even at this level there exists ag-gregate information more relevant than other We define then an AIS as follows

Definition 9

An AIS φ is defined as a tuple lt γ ξ λ ϑ$WAIS gt where γ is a name ξ

is a DMG λ is the accuracy of AIS ϑ is the completeness of AIS $ is theconsistency of AIS and WAIS represent the relevance of AIS in the databaseIn a database there are usually more than an aggregate information for whichone is interested to measure the quality

Definition 9

With respect to D we define ASSET (Aggregate information Schema Set) asthe collection of all the relevant AIS of the database

The DMG completely describes the relations between the different data itemsof a given AIS and the relevance of each of these data respect to the data qualityparameter It is the ldquoroad maprdquo that is used to evaluate the quality of a sanitizedAIS

As it is probably obvious the design of a good IQM schema and more ingeneral of a good ASSET it is fundamental to characterize in a correct way thetarget database

201 An IQM Example

In order to magnify which is the real meaning of a data quality model we givein this section a brief description of a possible context of application and wethen describe an example of IQM Schema

The Example Context

Modern power grid starting today to use Internet and more generally publicnetwork in order to monitor and to manage remotely electric apparels (local

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 16: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

14 CHAPTER 2 THE DATA QUALITY SCHEME

Constraint13

Data Attribute13

Accuracy13

Completeness13

Operator13

Constraint13Weight13

DMG13

Figure 21 Example of DMG

energy stations power grid segments etc) [62] In such a context there exist alot of problems related to system security and to the privacy preservation

More in details we take as example a very actual privacy problem causedby the new generation of home electricity meters In fact such a type of newmeters are able to collect a big number of information related to the homeenergy consumption Moreover they are able to send the collected informationto a central remote database The information stored in such database can beeasily classified as sensitive

For example the electricity meters can register the energy consumption perhour and per day in a target house From the analysis of this information itis easy to retrieve some behavior of the people living in this house (ie fromthe 9 am to 1 pm the house is empty from the 1 pm to 230 pm some-one is at home from the 8 pm the whole family is at home etc) Thistype of information is obviously sensitive and must be protected On the otherhand information like the maximum energy consumption per family the av-erage consumption per day the subnetwork of pertinence and so on have anhigh relevance for the analysts who want to correctly analyze the power gridto calculate for example the amount of power needed in order to guarantee anacceptable service over a certain energy subnetwork

This is a good example of a situation in which the application of a privacypreserving techniques without knowledge about the meaning and the relevance

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 17: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

15

13

ASSET13

AIS13

AIS13AIS13

AIS13

Accuracy13Completeness13

Consistency13

Figure 22 Example ASSET to every AIS contained in the Asset the dimensionsof Accuracy Completeness and Consistency are associated Moreover to everyAIS the proper DMG is associated To every attribute and to every constraintcontained in the DMG the three local DQ parameters are associated

of the data contained in the database could cause a relevant damage (ie wrongpower consumption estimation and consequently blackout)

An IQM Example

The electric meters database contains several information related to energy con-sumption average electric frequency fault events etc It is not our purpose todescribe here the whole database In order to give an example of IQM schemawe describe two tables of this database and two DMG of a possible IQM schema

In our example we consider the tables related to the localization informationof the electric meter and the information related to the per day consumptionregistered by a target electric meter

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 18: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

16 CHAPTER 2 THE DATA QUALITY SCHEME

More in details the two table we consider have the following attributesLocalization Table

bull Energy Meter ID it contains the ID of the energy meter

bull Local Code it is the code of the cityregion in which the meter is

bull City it contains the name of the city in which the home hosting the meteris

bull Network it is the code of the electric network

bull Subnetwork it is the code of the electric subnetwork

Per Day Data consumption table

bull Energy Meter ID it contains the ID of the energy meter

bull Average Energy Consumption it contains the average energy used bythe target home during a day

bull Date It contains the date of registration

bull Energy consumption per hour it is a vector of attributes containing theaverage energy consumption per hour

bull Max Energy consumption per hour it is a vector of attributes contain-ing the maximum energy consumption per hour

bull Max Energy Consumption it contains the maximum energy consump-tion per day measured for a certain house

As we have already underlined from this information it is possible to guessseveral sensitive information related to the behavior of the people living in atarget house controlled by an electric meter For this reason it is necessary toperform a sanitization process in order to guarantee a minimum level of privacy

However a sanitization without knowledge about the relevance and themeaning of data can be extremely dangerous In figure 23 two DMG schemasassociated to the data contained in the previously described tables are showedIn what follows we give a brief description about the two schema

AIS Loc

The AIS Loc is mainly related to the data contained in the localization table Asit is possible to see in figure 23 there exist some constraints associated with thedifferent items contained in this table For example we require the local code bein a range of values between 1 and 1000 (every different code is meaningless andcannot be interpreted moreover a value out of this range reveals to malicioususer that the target tuple is sanitized) The weight of this constraint is thenhigh (near to 1)

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 19: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

17

Value131-100013

Local Code13

City13

Subnetwork code13

Value 0-5013

They13correspond13to the same13

city13

The subnet is13contained in the13correct network13

Network13Number13

Average13energy13

consumption13

Average energy13used equal to13

Energy Counter13Id13

En131_am13

En132_am13

En133_am13

En1310_pm13

En1311_pm13

En1312_pm13

hellip13

Max energy13consumption13

Energy Counter13Id13

Max equal to13

A13I13S13

L13o13c13A13I13S13 13

C13o13n13s13

ASSET13A13I13S13 A13I13S13

A13 I13S13

DMG13DMG13

Max131_am13

Max132_am13

Max133_am13

Max1310_pm13

Max1311_pm13

Max1312_pm13

hellip13

Date13

Figure 23 Example of two DMG schemas associated to a National ElectricPower Grid database

Moreover due to the relevance of the information about the regional local-ization of the electric meter even the weights associated with the completenessand the accuracy of the local code have an high value

A similar constraint exists even over the subnetwork code With regard tothe completeness and the accuracy of this item a too high weight may compro-mise the ability to hide sensitive information In other words considering thethree main localization items local code network number subnetwork numberit is reasonable to think about a decreasing level of accuracy and completenessIn this way we try to guarantee the preservation of information about the areaof pertinence avoiding the exact localization of the electric meter (ie ldquoI knowthat it is in a certain city but I do not know the quarter in which it isrdquo)

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 20: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

18 CHAPTER 2 THE DATA QUALITY SCHEME

AIS Cons

In figure 23 is showed a DMG (AIS Cons) related to the energy consump-tion table In this case the value of the information stored in this table ismainly related to the maximum energy consumption and the average energyconsumption For this reason their accuracy and completeness weights mustbe high (to magnify their relevance in the sanitization phase) However itis even necessary to preserve the coherence between the values stored in theEnergy consumption per hour vector and the average energy consumption perday In order to do this a complex constraint is used In this way we describethe possibility to permute or to modify the values of energy consumption perhour (in the figure En 1 am En 2 am etc) maintaining however the averageenergy consumption consistent

At the same way it is possible to use another complex constraint in orderto express the same concept for the maximum consumption per day in relationwith the maximum consumption per hour Exist finally a relation between themaximum energy consumption per hour and the correspondent average con-sumption per hour In fact the max consumption per hour cannot be smallerthan the average consumption per hour This concept is represented by a setof simple constraints Finally the weight of accuracy and completeness of theEnergy Meter ID are equal to 1 because a condition needed to make consistentthe database is to maintain at least the individuality of the electric meters

By the use of these two schemas we described some characteristics of theinformation stored in the database related with their meaning and their usageIn the following chapters we show some PPDM algorithms realized in order touse these additional schema in order to perform a softer (from the DQ point ofview) sanitization

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 21: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

Chapter 3

Data Quality BasedAlgorithms

A large number of Privacy Preserving Data Mining algorithms have been pro-posed The goal of these algorithms is to prevent privacy breaches arising fromthe use of a particular Data Mining Technique Therefore algorithms existdesigned to limit the malicious use of Clustering Algorithms of ClassificationAlgorithms of Association Rule algorithms

In this report we focus on Association Rule Algorithms We choose to studya new solution for this family of algorithms for the following motivations

bull Association Rule Data Mining techniques has been investigated for about20 years For this reason exists a large number of DM algorithms andapplication examples exist that we take as a starting point to study howto contrast the malicious use of this technique

bull Association Rule Data Mining can be applied in a very intuitive way tocategoric databases A categoric database can be easily synthesized bythe use of automatic tools allowing us to quickly obtain for our tests a bigrange of different type of databases with different characteristics

bull Association Rules seem to be the most appropriate to be used in combi-nation with the IQM schema

bull Results of classification data mining algorithms can be easily convertedinto association rules This it is in prospective very important because itallows us to transfer the experience in association rule PPDM family tothe classification PPDM family without much effort

In what follows we present before some algorithms developed in the context ofCodmine project [97] (we remember here that our contribution in the CodmineProject was related to the algorithm evaluation and to the testing frameworkand not with the Algorithm development) We present then some considerations

19

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 22: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

20 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

about these algorithms and finally we present our new approach to associationrule PPDM algorithms

31 ConMine Algorithms

We present here a short description of the privacy preserving data mining algo-rithms which have been developed in the context of Codmine project In theassociation rule hiding context two different heuristic-based approaches havebeen proposed one is based on data deletion according to which the originaldatabase is modified by deleting the data values the other is based on datafuzzification thus hiding the sensitive information by inserting unknown valuesto the database Before giving more details concerning the proposed approacheswe briefly recall the problem formulation of association rule hiding

32 Problem Formulation

Let D be a transactional database and I = i1 in be a set of items whichrepresents the set of products that can be sold by the database owner Eachtransaction can be considered as subset of items in I We assume that the itemsin a transaction or an itemset are sorted in lexicographic order Accordingto a bitmap notation each transaction t in the database D is represented asa triple lt TID values of items size gt where TID is the identifier of thetransaction t values of items is a list of values one value for each item in Iassociated with transaction t An item is supported by a transaction t if itsvalue in the values of items is 1 and it is not supported by t if its value inthe values of items is 0 Size is the number of 1 values which appear in thevalues of items that is the number of items supported by the transaction Anassociation rule is an implication of the form X rArr Y between disjoint itemsets X

and Y in I Each rule is assigned both to a support and to a confidence valueThe first one represents the probability to find in the database transactionscontaining all the items in X cup Y whereas the confidence is the probabilityto find transactions containing all the items in X cup Y once we know that theycontain X Note that while the support is a measure of a frequency of a rule theconfidence is a measure of the strength of the relation between the antecedentX and the consequent Y of the rule Association rule mining process consistsof two steps 1) the identification of all the frequent itemsets that is all theitemsets whose supports are bigger than a pre-determined minimum supportthreshold min supp 2) the generation of strong association rules from thefrequent itemsets that is those frequent rules whose confidence values are biggerthan a minimum confidence threshold min conf Along with the confidenceand the support a sensitivity level is assigned only to both frequent and strongrules If a strong and frequent rule is above a certain sensitivity level the hidingprocess should be applied in such a way that either the frequency or the strengthof the rule will be reduced below the min supp and the min conf correspondingly

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 23: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

33 RULE HIDING ALGORITHMS BASED ON DATA DELETION 21

The problem of association rule hiding can be stated as follows given a databaseD a set R of relevant rules that are mined from D and a subset Rh of R wewant to transform D into a database Dprime in such a way that the rules in R canstill be mined except for the rules in Rh

33 Rule hiding algorithms based on data dele-tion

Rule hiding approach based on data deletion operates by deleting the datavalues in order to reduce either the support or the confidence of the sensi-tive rule below the corresponding minimum support or confidence thresholdsMST or MCT which are fixed by the user in such a way that the changesintroduced in the database are minimal The decrease in the support of a ruleX rArr Y is done by decreasing the support of its generating itemset X cup Y that is the support of either the rule antecedent X or the rule consequent Y through transactions that fully support the rule The decrease in the confidence

Conf(X rArr Y ) = Supp(XrArrY )Supp(X) of the rule X rArr Y instead can be accomplished

1) by decreasing the support of the rule consequent Y through transactionsthat support both X and Y 2) by increasing the support of the rule antecedentthrough transactions that do not fully support Y and partially support X Fivedata deletion algorithms have been developed for privacy preserving data min-ing whose pseudocodes are presented in Figures 31 32 and 33 Algorithm1 decreases the confidence of a sensitive rule below the min conf threshold byincreasing the support of the rule antecedent while keeping the rule supportfixed Algorithms 2 reduces the rule support by decreasing the frequency of theconsequent through transactions that support the rule Algorithm 3 operates asthe previous one with the only exception that the choice of the item to removeis done according to a minimum impact criterion Finally Algorithms 4 and 5decrease the frequency of large itemsets from which sensitive rules can be mined

34 Rule hiding algorithms based on data fuzzi-fication

According to the data fuzzification approach a sequence of symbols in the newalphabet of an item 01 is associated with each transaction where one sym-bol is associated with each item in the set of items I As before the ith valuein the list of values is 1 if the transaction supports the ith item the value is 0otherwise The novelty of this approach is the insertion of an uncertainty sym-bol a question mark in a given position of the list of values which means thatthere is no information on whether the transaction supports the correspondingitem In this case the confidence and the support of an association rule maybe not unequivocally determined but they can range between a minimum anda maximum level The minimum support of an itemset is defined as the per-

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 24: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

22 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the number |D| of transactions in D a set Rh of rules to hide the min conf thresholdOUTPUT the database D transformed so that the rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 T prime

lr= t isin D | t does not fully supports rr and partially supports lr

foreach transaction t in T prime

lrdo

2 tnum items = |I| minus |lr cap tvalues of items|

3 Sort(T prime

lr) in descending order of number of items of lr contained

Repeat until(conf(r) lt min conf)

4 t = T prime

lr[1]

5 set all ones(tvalues of items lr)6 Nlr = Nlr + 17 conf(r) = NrNlr

8 Remove t from T prime

lr9 Remove r from Rh

End

Figure 31 Algorithm 1 for Rule Hiding by Confidence Decrease

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 25: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

34 RULE HIDING ALGORITHMS BASED ON DATA FUZZIFICATION23

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(rr)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 supp(r) = Nr|D|9 conf(r) = NrNlr

10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D the size of thedatabase |D| a set Rh of rules to hide themin supp threshold the min conf thresholdOUTPUT the database D transformed so that therules in Rh cannot be mined

Begin

Foreach rule r in Rh do

1 Tr = t isin D | t fully supports rforeach transaction t in Tr do

2 tnum items = count(t)

3 Sort(Tr) in decreasing order of transaction sizeRepeat until (supp(r) lt min sup)

or (conf(r) lt min conf)

4 t = Tr[1]5 j = choose item(r)6 set to zero(jtvalues of items)7 Nr = Nr minus 18 Recompute Nlr

9 supp(r) = Nr|D|10 conf(r) = NrNlr

11 Remove t from Tr

12 Remove r from Rh

End

Figure 32 Algorithms 2 and 3 for Rule Hiding by Support Decrease

centage of transactions that certainly support the itemset while the maximumsupport represents the percentage of transactions that support or could supportthe itemset The minimum confidence of a rule is obviously the minimum levelof confidence that the rule can assume based on the support value and similarly

for the maximum confidence Given a rule r minconf(r) = minsup(r)lowast100maxsup(lr) and

maxconf(r) = maxsup(r)lowast100minsup(lr) where lr denotes the rule antecedent

Considering the support interval and the minimum support threshold MST we have the following cases for an itemset A

bull A is hidden when maxsup(A) is smaller than MST

bull A is visible with an uncertainty level when minsup(A) le MST lemaxsup(A)

bull A is visible if minsup(A) is greater than or equal to MST

The same reasoning applies to the confidence interval and the minimum confi-dence threshold (MCT )

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 26: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

24 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of size and support2 Th = TZ | Z isin Lh and forallt isin D

t isin TZ =rArr t supports ZForeach Z in Lh

3 Sort(TZ) in ascending order of transaction sizeRepeat until (supp(r) lt min sup)

4 t = popfrom(TZ)5 a = maximal support item in Z6 aS = X isin Lh|a isin X Foreach X in aS

7 if (t isin TX) delete(t TX)

8 delete(a t D)

End

INPUT the source database D the size of thedatabase |D| a set L of large itemsets the set Lh

of large itemsets to hide the min supp thresholdOUTPUT the database D modified by thehiding of the large itemsets in Lh

Begin

1 Sort(Lh) in descending order of support and sizeForeach Z in Lh do

2 i = 03 j = 04 lt TZ gt = sequence of transactions supporting Z5 lt Z gt = sequence of items in ZRepeat until (supp(r) lt min sup)

6 a =lt Z gt[i]7 t = lt TZ gt[j]8 aS = X isin Lh|a isin X Foreach X in aS

9 if (t isin TX) then delete(t TX)

10 delete(a t D)11 i = (i+1) modulo size(Z)12 j = j+1

End

Figure 33 Algorithms 4 and 5 for Rule Hiding by Support Decrease of largeitemsets

Traditionally a rule hiding process takes place according to two differentstrategies decreasing its support or its confidence In this new context theadopted alternative strategies aim to introduce uncertainty in the frequency orthe importance of the rules to hide The two strategies reduce the minimumsupport and the minimum confidence of the itemsets generating these rules be-low the minimum support threshold (MST ) and minimum confidence threshold(MCT ) correspondingly by a certain safety margin (SM) fixed by the user

The Algorithm 6 for reducing the support of generating large itemset of a sen-sitive rule replaces 1rsquos by ldquo rdquo for the items in transactions supporting theitemset until its minimum support goes below the minimum support thresholdMST by the fixed safety margin SM Algorithms 7 and 8 operate reducing theminimum confidence value of sensitive rules The first one decreases the min-imum support of the generating itemset of a sensitive rule by replacing itemsof the rule consequent by unknown values The second one instead increasesthe maximum support value of the antecedent of the rule to hide via placingquestion marks in the place of the zero values of items in the antecedent

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 27: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

35 CONSIDERATIONS ON CODMINE ASSOCIATION RULE ALGORITHMS25

INPUT the database D a set L of large itemsetsthe set Lh of large itemsets to hide MST and SMOUTPUT the database D modified by the fuzzificationof large itemsets in Lh

Begin

1 Sort Lh in descending order of size andminimum support of the large itemsets

Foreach Z in Lh

2 TZ = t isin D | t supports Z3 Sort the transaction in TZ in

ascending order of transaction sizeRepeat until (minsup(r) lt MST minus SM)

4 Place a mark for the item with thelargest minimum support of Z in thenext transaction in TZ

5 Update the supports of the affecteditemsets

6 Update the database D

End

Figure 34 Algorithm 6 for Rule Hiding by Support Reduction

35 Considerations on Codmine Association RuleAlgorithms

Based on results obtained by the performed tests (See Appendix A for moredetails) it is possible to make some considerations on the presented algorithmsAll these algorithms are able in any case (with the exclusion of the first al-gorithm) to hide the target sensitive information Moreover the performanceresults are acceptable The question is then ldquoWhy there is the necessity toimprove these algorithms if they all work wellrdquo There exist some coarse para-meters allowing us to measure the impact of the algorithms on the DQ of thereleased database Just to understand here what is the problem we say thatduring the tests we discovered in the sanitized database a considerable numberof new rules (artifactual rules) not present in the original database Moreoverwe discovered that some of the not sensitive rules contained in the original data-base after the sanitization were erased (Ghost rules) This behavior in a criticaldatabase could have a relevant impact For example as usual let us considerthe Health Database The introduction of artifactual rules may deceive a doctorthat on the basis of these rules can choose a wrong therapy for a patient In

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 28: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

26 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the source database D a set Rh

of rules to hide MST MCT and SMOUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 Tr=t in D | t fully supports r2 for each t in Tr count the number of items in t3 sort the transactions in Tr in ascending order

of the number of items supportedRepeat until (minsup(r) lt MST minus SM)

or (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin Tr

5 Choose the item j in rr with thehighest minsup

6 Place a mark for the place of j in t7 Recompute the minsup(r)8 Recompute the minconf(r)9 Recompute the minconf of other

affected rules10 Remove t from Tr

11 Remove r from Rh

End

INPUT the source database D a set Rh

of rules to hide MCT and SM OUTPUT the database D transformed so thatthe rules in Rh cannot be mined

Begin

Foreach r in Rh do

1 T prime

lr= t in D | t partially supports lr

and t does not fully support rr2 for each t in T prime

lrcount the number of items of lr in t

3 sort the transactions in T prime

lrin descending order of

the number of items of lr supportedRepeat until (minconf(r) lt MCT minus SM)

4 Choose the first transaction t isin T prime

lr5 Place a mark in t for the items in lr

that are not supported by t6 Recompute the maxsup(lr)7 Recompute the minconf(r)8 Recompute the minconf of other affected rules9 Remove t from T prime

lr10 Remove r from Rh

End

Figure 35 Algorithms 7 and 8 for Rule Hiding by Confidence Decrease

the same way if some other rules are erased a doctor could have a not completepicture of the patient health generating then some unpleasant situation Thisbehavior is due to the fact that the algorithms presented act in the same waywith any type of categorical database without knowing nothing about the useof the database the relationships between data etc More specifically we notethat the key point of this uncorrect behavior is in the selection of the item tobe modified In all these algorithm in fact the items are selected randomlywithout considering their relevance and their impact on the database We thenintroduce an approach to select the proper set of items minimizing the impacton the database

36 The Data Quality Strategy (First Part)

As explained before we want to develop a family of PPDM algorithms whichtake into account during the sanitization phase aspects related to the particu-lar database we want to modify Moreover we want such a PPDM family drivenby the DQ assurance In the context of Association Rules PPDM algorithm

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 29: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

36 THE DATA QUALITY STRATEGY (FIRST PART) 27

and more in detail in the family of Perturbation Algorithms described in theprevious section we note that the point in which it is possible to act in orderto preserve the quality of the database is the selection of the Item In otherwords the problem we want to solve is

How we can select the proper item to be modify in order to limit the impact ofthis modification on the quality of the database

The DQ structure we introduced previously can help us in achieving thisgoal The general idea we suggest here it is relatively intuitive In fact a tar-get DMG contains all the attributes involved in the construction of a relevantaggregate information Moreover a weight is associated with each attribute de-noting its accuracy and completeness Every constraint involving this attributeis associated to a weight representing the relevance of its violationFinally every AIS has its set of weights related to the three DQ measures andit is possible to measure the global level of DQ guaranteed after the sanitiza-tion We are then able to evaluate the effect of an attribute modification onthe Global DQ of a target database Assuming that we have available the Assetschema related to a target database our Data Hiding strategy can be describedas follows

1 A sensitive rule s is identified

2 The items involved in s are identified

3 By inspecting inside the Asset schema the DQ damage is computed sim-ulating the modification of these items

4 An item ranking is built considering the results of the previous operation

5 The item with a lower impact is selected

6 The sanitization is executed by modifying the chosen item

The proposed strategy can be improved In fact to compute the effect of themodifications in case of very complex information could be onerous in term ofperformancesWe note that in a target association rule a particular type of items may existthat is not contained in any DMG of the database Asset We call this type ofItems Zero Impact Items The modification of these particular items has noeffect on the DQ of the database because it was in the Asset description phasejudged not relevant for the aggregate information considered The improvementof our strategy is then the following before starting the Asset inspection phasewe search for the existence of Zero-Impact Items related to the target rulesIf they exist we do not perform the Asset inspection and we use one of theseZero-Impact items in order to hide the target rule

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 30: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

28 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

37 Data Quality Distortion Based Algorithm

In the Distortion Based Algorithms the database sanitization (ie the operationby which a sensitive rule is hidden) it is obtained by the distortion of some valuescontained in the transactions supporting the rule (partially or fully dependingfrom the strategy adopted) to be hidden In what we propose assuming towork with categorical databases like the classical Market Basket database 1the distortion strategy we adopt is similar to the one presented in the Codminealgorithms Once a good item is identified a certain number of transactionssupporting the rule are modified changing the item value from 1 to 0 until theconfidence of the rule is under a certain threshold Figure 36 shows the flowdiagram describing the algorithm

As it is possible to see the first operation required is the identificationof the rule to hide In order to do this it is possible to use a well knownalgorithm named APRIORI algorithm [5] that given a database extract allthe rules contained that have a confidence over a Threshold level 2 Detail onthis algorithm are described in the Appendix B The output of this phase is aset of rules From this set a sensitive rule will be identified (in general thisdepends from its meaning in a particular context but it is not in the scope ofour work to define when a rule must be considered sensitive) The next phaseimplies the determination of the Zero Impact Items (if they exist) associatedwit the target rule

If the Zero Impact Itemset is empty it means no item exists contained inthe rule that can be assumed to be non relevant for the DQ of the database Inthis case it is necessary to simulate what happens to the DQ of the databasewhen a non zero impact item is modified The strategy we can adopt in thiscase is the following we calculate the support and the confidence of the targetrule before the sanitization (we can use the output obtained by the AprioriAlgorithm)Then we compute how many steps are needed in order to downgradethe confidence of the rule under the minimum threshold (as it is possible to seefrom the algorithm code there is little difference between the case in whichthe item selected is part of the antecedent or part of the consequent of therule) Once we know how many transactions we need to modify it is easyinspecting the Asset to calculate the impact on the DQ in terms of accuracy andconsistency Note that in the distortion algorithm evaluating the completenessis without meaning because the data are not erased but modified The ItemImpact Rank Algorithm is reported in Figure 38The Sanitization algorithm works as follows

1 It selects all the transactions fully supporting the rule to be hidden

2 It computes the Zero Impact set and if there exist such type of itemsit randomly selects one of these items If this set is empty it calculates

1A market Basket Database is a database in which each transaction represents the productsacquired by the customers

2This Threshold level represents the confidence over which a rule can be identified asrelevant and sure

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 31: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

37 DATA QUALITY DISTORTION BASED ALGORITHM 29

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact (Accuracy-13Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value from 1 to13013

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 36 Flow Chart of the DQDB Algorithm

the ordered set on Impact Items and it selects the one with the lower DQEffect

3 Using the selected item it modifies the selected transactions until the ruleis hidden

The final algorithm is reported in Figure 39 Given the algorithm for sake ofcompleteness we try to give an idea about its complexity

bull The search of all zero impact items assuming the items to be orderedaccording to a lexicographic order can be executed in N lowastO(lg(M)) whereN is the size of the rule and M is the number of different items containedin the Asset

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 32: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

30 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hiddenOUTPUT the set (possibly empty) Zitem of zero Impact items discovered

1Begin

2 zitem=empty3 Rsup=Rh4 Isup=list of the item contained in Rsup5 While (Isup 6= empty)6 7 res=08 Select item from Isup9 res=Search(Asset Item)10 if (res==0)then zitem=zitem+item11 Isup=Isup-item1213return(zitem)14End

Figure 37 Zero Impact Algorithm

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Confp=Confa5 Nstep if ant=06 Nstep if post=07 While (Confa gt Th) do8 9 Nstep if ant++

10 Confa=Sup(Rh)minusNstep if ant

Sup(ant(Rh))minusNstep if ant

1112While (Confb gt Th) do1314 Nstep if post++

15 Confa=Sup(Rh)minusNstep if post

Sup(ant(Rh))

16

17For each item in Rh do1819 if (item isin ant(Rh)) then N=Nstep if ant20 else N=Nstep if post21 For each AIS isin Asset do22 23 node=recover item(AISitem)24 Accur Cost=nodeaccuracy Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

(Accur Cost lowastAISAccur weight)++(Constr Cost lowastAISCOnstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 38 Item Impact Rank Algorithm for the Distortion Based Algorithm(we remember here that ant(Rh) indicates the antecedent component of a rule)

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 33: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

38 DATA QUALITY BASED BLOCKING ALGORITHM 31

bull The Cost of the Impact Item Set computation is less immediate to formu-late However the computation of the sanitization step is linear (O(bSup(Rh)c))Moreover the constraint exploration in the worst case has a complex-ity in O(NC) where NC is the total number of constraints containedin the Asset This operation is repeated for each item in the rule Rh

and then it takes |Items|O(NC) Finally in order to sort the Item-set we can use the MergeSort algorithm and then the complexity is|itemset|O(lg(|itemset|))

bull The sanitization has a complexity in O(N) where N is the number oftransaction modified in order to hide the rule

INPUT the Asset Schema A associated to the target databasethe rule Rh to be hidden the target database DOUTPUT the Sanitized Database

1Begin

2 Zitem=DQDB Zero Impact(AssetRh)3 if (Zitem 6= empty) then item=random sel(Zitem)4 else5 6 Impact set=Items Impact rank(AssetRhSup(Rh)Sup(ant(Rh)) Threshold)7 item=best(Impact set)8 9 Tr=t isin D|tfully support Rh10While (RhConf gtThreshold) do11sanitize(Tritem)12End

Figure 39 The Distortion Based Sanitization Algorithm

38 Data Quality Based Blocking Algorithm

In the Blocking based algorithms the idea is to substitute the value of an itemsupporting the rule we want to hide with a meaningless symbol In this way weintroduce an uncertainty related to the question ldquoHow to interpret this valuerdquoThe final effect is then the same of the Distortion Based Algorithm The strategywe propose here as showed in Figure 310 is very similar to the one presentedin the previous section the only difference is how to compute the Impact on theData Quality For this reason the Zero Impact Algorithm is exactly the samepresented before for the DQDB Algorithm However some adjustments arenecessary in order to implement the Impact Item Rank Algorithm The majorsmodifications are related to the Number of Step Evaluation (that is of coursedifferent) and the Data Quality evaluation In fact in this case it does not make

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 34: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

32 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

sense to evaluate the Accuracy Lack because every time an item is changedthe new value is meaningless For this reason in this case we consider the pair(Completeness Consistency) and not the pair (Accuracy Consistency) Figure311 reports the new Impact Item Rank Algorithm

Identification of a sensible13rule13

Identification of the set of13zero_impact items13

Is the set13empty13

Select an Item from the13zero_impact set13

For each item in the13selected rule inspect on13the Asset and evaluate13

the DQ impact13(Completeness-13

Consistency)13

Rank the items by quality13impact13

Select the best item13

Select a transaction fully13supporting the sensitive13

rule13

Convert its value to rdquo13

Rule13hidden13

END13

NO13

Yes13

NO13

Yes13

Figure 310 Flow Chart of BBDB Algorithm

The last part of the algorithm is quite similar to the one presented for theDBDQ Algorithm The only relevant difference is the insertion of the ldquordquo valueinstead of the 0 value From the complexity point of view the complexity is thesame of the complexity of the previous algorithm

39 The Data Quality Strategy (Second Part)

Preliminary tests performed on these algorithms show that the impact on theDQ of the database is reduced However we note an interesting phenomenonIn what follows we give a description of this phenomena an interpretation anda first solution

Consider a DataBase D to be sanitized D contains a set of four sensitiverules In order to protect these rules we normally execute the following steps

1 We select a rule Rh from the pool of sensitive rules

2 We apply our algorithm (DQDB or BBDQ)

3 We obtain a sanitized database in which the DQ is preserved (with respectto the rule Rh)

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 35: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

39 THE DATA QUALITY STRATEGY (SECOND PART) 33

INPUT the Asset Schema A associatedto the target databasethe rule Rh to be hiddenthe sup(Rh) the sup(ant(Rh)) the Threshold ThOUTPUT the set Qitem ordered by Quality Impact

1 Begin

2 Qitem=empty

3 Confa=Sup(Rh)

Sup(ant(Rh))

4 Nstepant=countstep ant(confaThresholds (MinMax))5 Nsteppost=countstep post(confaThresholds (MinMax))6 For each item in Rh do7 8 if (item isin ant(Rh)) then N=Nstepant9 else N=Nsteppost10 For each AIS isin Asset do11 12 node=recover item(AISitem)

24 Complet Cost=nodeComplet Weight lowastN 25 Constr Cost=Constr surf(Nnode)26 itemimpact=itemimpact+

+(Complet Cost lowastAISComplet weight)++(Constr Cost lowastAISConstr weight)

27 2829sort by impact(Items)30Return(Items)End

Figure 311 Item Impact Rank Algorithm for the Blocking Based Algorithm

4 We reapply the process for all the rules contained in the set of sensitiverules

What we observed in a large number of tests is that even if after everysanitization the DQ related to a single rule is preserved (or at least the damageis mitigated) at the end of the whole sanitization (every rule hidden) thedatabase had a lower level of DQ that the one we expected

Reflecting on the motivation of this behavior we argue that the sanitizationof a certain rule Ra may damage the DQ related to a previously sanitized ruleRp having for example some items in common

For this reason here we propose an improvement to our strategy to address(but not to completely solve) this problem It is important to notice that thisis a common problem for all the PPDM Algorithm and it is not related only tothe algorithm we proposed in the previous sections

The new strategy is based on the following intuition The Global DQ degra-dation is due to the fact that in some sanitization items are modified that wereinvolved in other already sanitized rules If we try to limit the number of dif-ferent items modified during the whole process (ie during the sanitization ofthe whole rules set) we obtain a better level of DQ An approach to limit thenumber of modified items is to identify the items supporting the biggest numberof sensitive rules as the most suitable to be modified and then performing in aburst the sanitization of all the rules The strategy we adopt can be summarizedas follows

1 We classify the items supporting the rules contained in the sensitive setanalyzing how many rules simultaneously contain the same item (we count

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 36: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

34 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

the occurrences of the items in the different rules)

2 We identify given the sensitive set the Global Set of Zero Impact Items

3 We order the Zero Impact Set by considering the previously counted oc-currences

4 The sanitization starts by considering the Zero Impact Items that supportthe maximum number of Rules

5 Simultaneously all the rules supported by the chosen item are sanitizedand removed from the sensitive set when they result hidden

6 If the Zero Impact Set is not sufficient to hide every sensitive rule for theremaining items their DQ Impact is computed

7 From the resulting classification if several items exist with the same im-pact the one supporting the highest number of rules is chosen

8 The process continues until all the rules are hidden

From a methodological point of view we can divide the complete strategy weproposed into three main blocks

bull Pre Hiding phase

bull Multi Zero Impact Hiding

bull Multi Low Impact Hiding

This strategy can be easily applied to the algorithms presented before For thisreason we present in the remainder of this report the algorithms giving at theend a more general algorithm

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 37: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

310 PRE-HIDING PHASE 35

310 Pre-Hiding Phase

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the items supporting the rules sorted by occurrences number

1Begin

2 Rsup=Rs3 while (Rsup 6= empty) do4 5 select r from Rsup6 for (i = 0 i + + i le transactionmaxsize) do7 if (item[i] isin r) then8 9 item[i]count++10 item[i]list=item[i]listcupr11 12 Rsup=Rsupminus r1314sort(item)15Return(item) End

Figure 312 The Item Occurrences classification algorithm

INPUT The sensitive Rules set Rs the list of the database itemsOUTPUT The list of the Zero Impact Items

1Begin

2 Rsup=Rs3 Isup=list of all the items4 Zitem = empty5 while (Rsup 6= empty) and (Isup 6= empty) do6 7 select r from Rsup8 for each (item isin r) do9 10 if (item isin Isup) then11 if (item isin Asset) then Zitem = Zitem + item12 Isup = Isupminus item13 14 Rsup = Rsupminus r1516return(Zitem)End

Figure 313 The Multi Rule Zero Impact Algorithm

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 38: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

36 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

This phase executes all the operations not specifically related to the hidingoperations More in details in this phase the items are classified according tothe number of supported rules Moreover the Zero Impact Items are classifiedaccording to the same criteria Figure 312 reports the algorithm used to discoverthe occurrences of the items This algorithm takes as input the set of sensitiverules and the list of all the different items defined for the target database Forevery rule contained in the set we check if the rule is supported by every itemcontained in the Item list If this is the case a counter associated with the itemis incremented and an identifier associated with the rule is inserted into the listof the rules supported by a certain item This information is used in one of thesubsequent steps in order to optimize the computation Once a rule has beenchecked with every item in the list it is removed from the sensitive set Froma computational point of view it is relevant to note that the entire process isexecuted a number of times equal to

|sensitive set| lowast |Item List|

Figure 313 reports the algorithm used to identify the Zero Impact Items forthe sensitive Rules set It is quite similar to the one presented in the previoussection the difference is just that it searches in a burst the Zero-Impact itemsfor all the sensitive rules

311 Multi Rules Zero Impact Hiding

The application of the two algorithms presented before returns two importantinformation

bull The items supporting more than a rule and more in details the exactnumber of rules supported by these Items

bull The items supporting a rule isin sensitive rules set that have the impor-tant property of being without impact on the DQ of the database

Using these relevant information we are then able to start the sanitizationprocess for those rules supported by Zero Impact Items The strategy we adoptis thus as follows

1 We extract from the list of Items in an ordered manner the items alreadyidentified as Zero Impact (the Izm set)

2 For each item contained in the new set we build the set of the sensitiverules supported (Rss) and the set of transactions supporting one of therules contained in Rss (Tss)

3 We then select a transaction in Tss and we perform the sanitization onthis transaction (by using either blocking or distortion)

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 39: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

311 MULTI RULES ZERO IMPACT HIDING 37

4 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

5 These operations are repeated for every item contained in the Izm set

The result of this approach can be

bull A sanitized Data Base with an optimal DQ this is the case inwhich all sensitive rules are supported by Zero Impact Items

bull A partially Sanitized Data Base with an optimal DQ this is thecase in which the sensitive set has a subset of rules not supported byZero Impact Items

Obviously in a real case the first result rarely happens For this reason in thiswe present the last step of our strategy that allows us to obtain in any case acompletely sanitized database

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 40: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

38 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the list of the database items (with the Rule support)the list of Zero Impact Items the Database DOUTPUT A partially Sanitized Database PS

1Begin

2 Izm=item isin Itemlist|item isin Zitem3 for (i = 0 i + + i = |Izm|) do4 5 Rss=r isin Rs|Izm[i] isin r6 Tss=t isin D|tfullysupportarule isin Rss7 while(Tss 6= empty)and(Rss 6= empty) do8 9 select a transaction from Tss10 Tss = Tssminus t11 Sanitize(tIzm[i])12 Recompute Sup Conf(Rss)13 forallr isin Rss|(rsup lt minsup)Or(rconf lt minconf) do14 15 Rss=Rss-r16 Rs=Rs-r17 End

Figure 314 The Multi Rules Zero Impact Hiding Algorithm

312 Multi Rule Low Impact algorithm

In this section we consider the case in which we need to sanitize a databasehaving a sensitive Rules set not supported by Zero Impact Items In this casewe have two possibilities to use the Algorithm presented in the data qualitysection that has the advantage of being very efficient or if we prefer to maintainas much as possible a good DQ to extend the concept of Multi-Rule approachto this particular case The strategy we propose here can be summarized asfollows

1 We extract from the ordered list of items only the items supporting therules contained in the sensitive Rules set Each item contains the list ofthe supported rules

2 For each item and for each rule contained in the list of the supportedrules the estimated DQ impact is computedThe Maximum Data QualityImpact estimated is associated to every item

3 The Items are sorted by Max Impact and number of supported rules

4 For each item contained in the new ordered set we build the set of thesensitive rules supported (Rss) and the set of transactions supporting oneof the rules contained in Rss (Tss)

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 41: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

312 MULTI RULE LOW IMPACT ALGORITHM 39

5 We then select a transaction in Tss and we perform the sanitization onthis transaction (blocking or distortion)

6 We recompute the support and the confidence value for the rules containedin the Rss set and if one of the rule results hidden it is removed from theRss set and from the Rs set

7 These operations are repeated for every item contained in the Izm setuntil all rules are sanitized

The resulting database will be Completely Sanitized and the damagerelated to the data quality will be mitigated Figure 315 reports the algorithmAs it is possible to note in the algorithm we do not specify if the sanitizationis based on blocking or distortion We consider this operation as a black boxThe proposed strategy is completely independent from the type of sanitizationused In fact it is possible to substitute the code for blocking algorithm withthe code for distortion algorithm without compromising the DQ properties wewant to preserve with our algorithm Figure 316 reports the global algorithmformalizing our strategy More in details taking as input to this algorithmthe Asset description the group of rules to be sanitized the database and thethresholds under which consider hidden a rule the algorithm computes theZero-Impact set and the Item-Occurrences rank Then it will try to hide therules modifying the Zero Impact Items and finally if there exist other rules tobe hidden the Low Impact Algorithm is invoked

INPUT the Asset Schema A associatedto the target databasethe sensitive Rules set Rsto be hidden the Thresholds Th the list of itemsOUTPUT the sanitized database SDB

1 Begin

2 Iml=item isin itemlist|itemsupportsaruler isin Rs3 for each item isin Iml do4 5 for each rule isin itemlist do6 7 N=computestepitemthresholds8 For each AIS isin Asset do9 10 node=recover item(AISitem)11 itemlistrulecost=Itemlistrulecost

+ComputeCost(nodeN)12 13 itemmaxcost=maxcost(item)14 15 Sort Iml by max cost and rule supported

16While(Rs 6= empty)do1718 Rss=r isin Rs|Iml[1] isin r19 Tss=t isin D|tfullysupportarule isin Rss20 while(Tss 6= empty)and(Rss 6= empty) do21 22 select a transaction from Tss23 Tss = Tssminus t24 Sanitize(tIml[1])25 Recompute Sup Conf(Rss)26 forallr isin Rss|(rsup lt minsup)Or

(rconf lt minconf) do27 28 Rss=Rss-r29 Rs=Rs-r30 31 32 Ilm=Ilm-Ilm[1]3334 End

Figure 315 Low Data Quality Impact Algorithm

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 42: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

40 CHAPTER 3 DATA QUALITY BASED ALGORITHMS

INPUT The sensitive Rules set Rs the Database D the Assetthe thresholdsOUTPUT A Sanitized Database PS

1Begin

2 Item=Item Occ Class(RsItemlist)3 Zitem=Multi Rule Zero Imp(RsItemlist)4 if (Zitem 6= empty) then MR zero Impact H(DRsItemZitem)5 if (Rs 6= empty) then MR low Impact H(DRsItem)End

Figure 316 The General Algorithm

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 43: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

Chapter 4

Conclusions

The problem of the sanitization impact on the quality of the database has beento the best of our knowledge never addressed by previous approaches to thePPDM In this work we have explored the concepts related to DQ Moreover wehave identified the most suitable set of parameters that can be used to representthe DQ in the context of PPDM Previous approaches have addressed the PPDMproblem not considering the intrinsic meaning of the information stored in adatabase This lack is the main cause of the DQ problem In fact the DQ isstrongly related to the use of the data and then indirectly to the meaning of thedata Starting from this consideration we introduced a formal schema allowingus to represent the relevant information stored into a database This schema isable to magnify the relevance of each attribute contained into a set of high levelinformation the relationships among the attributes and the relevance of the DQparameters associated with the Information Schema Based on this schema weproposed a first strategy and two PPDM algorithms (distortion and blockingbased) with the aim of obtaining a sanitization which is better with respect toDQ We have then refined our strategy in order to hide simultaneously a set ofsensible rules On the basis of this new strategy we have proposed then a suiteof algorithms allowing us to build a most sophisticated PPDM Data QualityBased Algorithm

41

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 44: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

42 CHAPTER 4 CONCLUSIONS

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 45: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

Bibliography

[1] N Adam and J Worthmann Security-control methods for statistical data-bases a comparative study ACM Comput Surv Volume 21(4) pp515-556 year 1989 ACM Press

[2] D Agrawal and C C Aggarwal On the Design and Quantification ofPrivacy Preserving Data Mining Algorithms In Proceedings of the 20thACM Symposium on Principle of Database System pp 247-255 year2001 ACM Press

[3] R Agrawal T Imielinski and A Swami Mining Association Rules be-tween Sets of Items in Large Databases Proceedings of ACM SIGMODpp 207-216 May 1993 ACM Press

[4] R Agrawal and R Srikant Privacy Preserving Data Mining In Pro-ceedings of the ACM SIGMOD Conference of Management of Data pp439-450 year 2000 ACM Press

[5] R Agrawal and R Srikant Fast algorithms for mining association rulesIn Proceeding of the 20th International Conference on Very Large Data-bases Santiago Chile June 1994 Morgan Kaufmann

[6] G M AMDAHL Validity of the Single-Processor Approach to AchievingLarge Scale Computing Capabilities AFIPS Conference Proceedings(April1967)pp 483-485 Morgan Kaufmann Publishers Inc

[7] M J Atallah E Bertino A K Elmagarmid M Ibrahim and V SVerykios Disclosure Limitation of Sensitive Rules In Proceedings of theIEEE Knolwedge and Data Engineering Workshop pp 45-52 year 1999IEEE Computer Society

[8] D P Ballou H L Pazer Modelling Data and Process Quality in MultiInput Multi Output Information Systems Management science Vol 31Issue 2 pp 150-162 (1985)

[9] Y Bar-Hillel An examination of information theory Philosophy of Sci-ence volume 22 pp86-105 year 1955

43

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 46: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

44 BIBLIOGRAPHY

[10] E Bertino I Nai Fovino and L Parasiliti Provenza A Framework forEvaluating Privacy Preserving Data Mining Algorithms Data Mining andKnowledge Discovery Journal year 2005 Kluwert

[11] EBertino and INai Fovino Information Driven Evaluation of Data Hid-ing Algorithms 7th International Conference on Data Warehousing andKnowledge Discovery Copenaghen August 2005 Springer-Verlag

[12] N M Blachman The amount of information that y gives about X IEEETruns Inform Theon vol IT-14 pp 27-31 Jan 1968 IEEE Press

[13] L Breiman J Friedman R Olshen and C Stone Classification of Re-gression Trees Wadsworth International Group year 1984

[14] S Brin R Motwani J D Ullman and S Tsur Dynamic itemset count-ing and implication rules for market basket data In Proc of the ACMSIGMOD International Conference on Management of Data year 1997ACM Press

[15] L Chang and I S Moskowitz Parsimonious downgrading and decisiontrees applied to the inference problem In Proceedings of the 1998 NewSecurity Paradigms Workshop pp82-89 year 1998 ACM Press

[16] P Cheeseman and J Stutz Bayesian Classification (AutoClass) Theoryand Results Advances in Knowledge Discovery and Data Mining AAAIPressMIT Press year 1996

[17] M S Chen J Han and P S Yu Data Mining An Overview from aDatabase Perspective IEEE Transactions on Knowledge and Data Engi-neering vol 8 (6) pp 866-883 year 1996 IEEE Educational ActivitiesDepartment

[18] F Y Chin and G Ozsoyoglu Auditing and inference control in statisticaldatabases IEEE Trans Softw Eng SE-8 6 (Apr) pp 574-582 year1982 IEEE Press

[19] F Y Chin and G Ozsoyoglu Statistical database design ACM TransDatabase Syst 6 1 (Mar) pp 113-139 year 1981 ACM Press

[20] L H Cox Suppression methodology and statistical disclosure control JAm Stat Assoc 75 370 (June) pp 377-385 year 1980

[21] E Dasseni V S Verykios A K Elmagarmid and E Bertino HidingAssociation Rules by using Confidence and Support in proceedings of the4th Information Hiding Workshop pp 369383 year 2001 Springer-Verlag

[22] D Defays An efficient algorithm for a complete link method The Com-puter Journal 20 pp 364-366 1977

[23] D E Denning and J Schlorer Inference control for statistical databasesComputer 16 (7) pp 69-82 year 1983 (July) IEEE Press

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 47: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

BIBLIOGRAPHY 45

[24] D Denning Secure statistical databases with random sample queriesACM TODS 5 3 pp 291-315 year 1980

[25] D E Denning Cryptography and Data Security Addison-Wesley Read-ing Mass 1982

[26] V Dhar Data Mining in finance using counterfactuals to generate knowl-edge from organizational information systems Information Systems Vol-ume 23 Number 7 pp 423-437(15) year 1998

[27] J Domingo-Ferrer and V Torra A Quantitative Comparison of Disclo-sure Control Methods for Microdata Confidentiality Disclosure and DataAccess Theory and Practical Applications for Statistical Agencies pp113-134 P Doyle J Lane J Theeuwes L Zayatz ed North-Hollandyear 2002

[28] P Domingos and M Pazzani Beyond independence Conditions for theoptimality of the simple Bayesian classifier Proceedings of the Thir-teenth International Conference on Machine Learning pp 105-112 SanFrancisco CA year 1996 Morgan Kaufmann

[29] P Drucker Beyond the Information Revolution The Atlantic Monthly1999

[30] P O Duda and P E Hart Pattern Classification and Scene AnalysisWiley year 1973 New York

[31] G T Duncan S A Keller-McNulty and S L Stokes Disclosure risks vsdata utility The R-U confidentiality map Tech Rep No 121 NationalInstitute of Statistical Sciences 2001

[32] C Dwork and K Nissim Privacy preserving data mining in verticallypartitioned database In Crypto 2004 Vol 3152 pp 528-544

[33] D L EAGER J ZAHORJAN and E D LAZOWSKA Speedup Ver-sus Efficiency in Parallel Systems IEEE Trans on Computers C-38 3(March 1989) pp 408-423 IEEE Press

[34] L Ertoz M Steinbach and V Kumar Finding clusters of different sizesshapes and densities in noisy high dimensional data In Proceeding tothe SIAM International Conference on Data Mining year 2003

[35] M Ester H P Kriegel J Sander and X XU A density-based algorithmfor discovering clusters in large spatial databases with noise In Proceed-ings of the 2nd ACM SIGKDD pp 226-231 Portland Oregon year 1996AAAI Press

[36] A Evfimievski Randomization in Privacy Preserving Data MiningSIGKDD Explor Newsl vol 4 number 2 year 2002 pp 43-48 ACMPress

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 48: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

46 BIBLIOGRAPHY

[37] A Evfimievski R Srikant R Agrawal and J Gehrke Privacy PreservingMining of Association Rules In Proceedings of the 8th ACM SIGKDDDInternational Conference on Knowledge Discovery and Data Mining year2002 Elsevier Ltd

[38] S E Fahlman and C Lebiere The cascade-correlation learning architec-ture Advances in Neural Information Processing Systems 2 pp 524-532Morgan Kaufmann year 1990

[39] R P Feynman R B Leighton and M Sands The Feynman Lectureson Physics v I Reading Massachusetts Addison-Wesley PublishingCompany year 1963

[40] S Fortune and J Wyllie Parallelism in Random Access Machines ProcTenth ACM Symposium on Theory of Computing(1978) pp 114-118ACM Press

[41] W Frawley G Piatetsky-Shapiro and C Matheus Knowledge Discoveryin Databases An Overview AI Magazine pp 213-228 year 1992

[42] S P Ghosh An application of statistical databases in manufacturingtesting IEEE Trans Software Eng 1985 SE-11 7 pp 591-596 IEEEpress

[43] SPGhosh An application of statistical databases in manufacturing test-ing In Proceedings of IEEE COMPDEC Conference pp 96-103year1984 IEEE Press

[44] S Guha R Rastogi and K Shim CURE An efficient clustering algorithmfor large databases In Proceedings of the ACM SIGMOD Conference pp73-84 Seattle WA 1998 ACM Press

[45] S Guha R Rastogi and K Shim ROCK A robust clustering algorithmfor categorical attributes In Proceedings of the 15th ICDE pp 512-521Sydney Australia year 1999 IEEE Computer Society

[46] J Han and M Kamber Data Mining Concepts and Techniques TheMorgan Kaufmann Series in Data Management Systems Jim Gray SeriesEditor Morgan Kaufmann Publishers August 2000

[47] J Han J Pei and Y Yin Mining frequent patterns without candidategeneration In Proceeding of the 2000 ACM-SIGMOD International Con-ference on Management of Data Dallas Texas USA May 2000 ACMPress

[48] M A Hanson and R L Brekke Workload management expert system -combining neural networks and rule-based programming in an operationalapplication In Proceedings Instrument Society of America pp 1721-1726year 1988

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 49: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

BIBLIOGRAPHY 47

[49] J Hartigan and M Wong Algorithm AS136 A k-means clustering algo-rithm Applied Statistics 28 pp 100-108 year 1979

[50] A Hinneburg and D Keim An efficient approach to clustering large mul-timedia databases with noise In Proceedings of the 4th ACM SIGKDDpp 58-65 New York year 1998 AAAI Press

[51] T Hsu C Liau and DWang A Logical Model for Privacy ProtectionLecture Notes in Computer Science Volume 2200 Jan 2001 pp 110-124Springer-Verlag

[52] IBM Synthetic Data GeneratorhttpwwwalmadenibmcomsoftwarequestResourcesdatasetssyndatahtml

[53] M Kantarcioglu and C Clifton Privacy Preserving Distributed Miningof Association Rules on Horizontally Partitioned Data In Proceedingsof the ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery pp 24-31 year 2002 IEEE Educational ActivitiesDepartment

[54] G Karypis E Han and V Kumar CHAMELEON A hierarchical clus-tering algorithm using dynamic modeling COMPUTER 32 pp 68-75year 1999

[55] L Kaufman and P Rousseeuw Finding Groups in Data An Introductionto Cluster Analysis John Wiley and Sons New York year 1990

[56] W Kent Data and reality North Holland New York year 1978

[57] S L Lauritzen The em algorithm for graphical association models withmissing data Computational Statistics and Data Analysis 19 (2) pp191-201 year 1995 Elsevier Science Publishers B V

[58] W Lee and S Stolfo Data Mining Approaches for Intrusion DetectionIn Proceedings of the Seventh USENIX Security Symposium (SECURITYrsquo98) San Antonio TX January 1998

[59] A V Levitin and T C Redman Data as resource properties implica-tions and prescriptions Sloan Management review Cambridge Vol 40Issue 1 pp 89-101 year 1998

[60] Y Lindell and B Pinkas Privacy Preserving Data Mining Journal ofCryptology vol 15 pp 177-206 year 2002 Springer Verlag

[61] R M Losee A Discipline Independent Definition of Information Journalof the American Society for Information Science 48 (3) pp 254-269 year1997

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 50: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

48 BIBLIOGRAPHY

[62] M Masera I Nai Fovino R Sgnaolin A Framework for the SecurityAssessment of Remote Control Applications of Critical Infrastructure 29thESReDA Seminar ldquoSystems Analysis for a More Secure Worldrdquo year 2005

[63] G MClachlan and K Basford Mixture Models Inference and Applica-tions to Clustering Marcel Dekker New York year 1988

[64] M Mehta J Rissanen and R Agrawal MDL-based decision tree pruningIn Proc of KDD year 1995 AAAI Press

[65] G L Miller Resonance Information and the Primacy of ProcessAncient Light on Modern Information and Communication Theory andTechnology PhD thesis Library and Information Studies Rutgers NewBrunswick NJ May 1987

[66] I S Moskowitz and L Chang A decision theoretical based system forinformation downgrading In Proceedings of the 5th Joint Conference onInformation Sciences year 2000 ACM Press

[67] S R M Oliveira and O R Zaiane Toward Standardization in PrivacyPreserving Data Mining ACM SIGKDD 3rd Workshop on Data MiningStandards pp 7-17 year 2004 ACM Press

[68] S R M Oliveira and O R Zaiane Privacy Preserving Frequent ItemsetMining Proceedings of the IEEE international conference on Privacysecurity and data mining pp 43-54 year 2002 Australian ComputerSociety Inc

[69] S R M Oliveira and O R Zaiane Privacy Preserving Clustering byData Transformation In Proceedings of the 18th Brazilian Symposiumon Databases Manaus Amazonas Brazil pp 304-318 year 2003

[70] K Orr Data Quality and System Theory Comm of the ACM Vol 41Issue 2 pp 66-71 Feb 1998 ACM Press

[71] M A Palley and J S Simonoff The use of regression methodology forcompromise of confidential information in statistical databases ACMTrans Database Syst 124 (Dec) pp 593-608 year 1987

[72] J S Park M S Chen and P S Yu An Effective Hash Based Algorithmfor Mining Association Rules Proceedings of ACM SIGMOD pp 175-186 May 1995 ACM Press

[73] Z Pawlak Rough SetsTheoretical Aspects of Reasoning about DataKluwer Academic Publishers 1991

[74] G Piatetsky-Shapiro Discovery analysis and presentation of strongrules Knowledge Discovery in Databases pp 229-238 AAAIMIT Pressyear 1991

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 51: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

BIBLIOGRAPHY 49

[75] A D Pratt The Information of the Image Ablex Norwood NJ 1982

[76] J R Quinlan C45 Programs for Machine Learning Morgan Kaufmannyear 1993

[77] J R Quinlan Induction of decision trees Machine Learning vol 1 pp81-106 year 1986 Kluwer Academic Publishers

[78] R Rastogi and S Kyuseok PUBLIC A Decision Tree Classifier thatIntegrates Building and Pruning Data Mining and Knowledge Discoveryvol 4 n4 pp 315-344 year 2000

[79] R Rastogi and K Shim Mining Optimized Association Rules with Cat-egorical and Numeric Attributes Proc of International Conference onData Engineering pp 503-512 year 1998

[80] H L Resnikoff The Illusion of Reality Springer-Verlag New York 1989

[81] S J Rizvi and J R Haritsa Maintaining Data Privacy in AssociationRule Mining In Proceedings of the 28th International Conference on VeryLarge Databases year 2003 Morgan Kaufmann

[82] S J Russell J Binder D Koller and K Kanazawa Local learning inprobabilistic networks with hidden variables In International Joint Con-ference on Artificial Intelligence pp 1146-1152 year 1995 Morgan Kauf-mann

[83] D E Rumelhart G E Hinton and R J Williams Learning internalrepresentations by error propagation Parallel distributed processing Ex-plorations in the microstructure of cognition Vol 1 Foundations pp318ndash362 Cambridge MA MIT Press year 1986

[84] G Sande Automated cell suppression to reserve confidentiality of businessstatistics In Proceedings of the 2nd International Workshop on StatisticalDatabase Management pp 346-353 year 1983

[85] A Savasere E Omiecinski and S Navathe An efficient algorithm formining association rules in large databases In Proceeding of the Con-ference on Very Large Databases Zurich Switzerland September 1995Morgan Kaufmann

[86] J Schlorer Information loss in partitioned statistical databases ComputJ 26 3 pp 218-223 year 1983 British Computer Society

[87] C E Shannon A Mathematical Theory of Communication Bell SystemTechnical Journal vol 27(July and October)1948 pp379423 pp 623-656

[88] C E Shannon and W Weaver The Mathematical Theory of Communi-cation University of Illinois Press Urbana Ill 1949

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 52: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

50 BIBLIOGRAPHY

[89] A Shoshani Statistical databases characteristicsproblems and some so-lutions Proceedings of the Conference on Very Large Databases (VLDB)pp208-222 year 1982 Morgan Kaufmann Publishers Inc

[90] R SIBSON SLINK An optimally efficient algorithm for the single linkcluster method Computer Journal 16 pp 30-34 year 1973

[91] P Smyth and R M Goodman An information theoretic Approach toRule Induction from databases IEEE Transaction On Knowledge AndData Engineering vol 3 n4 August1992 pp 301-316 IEEE Press

[92] L Sweeney Achieving k-Anonymity Privacy Protection using General-ization and Suppression International Jurnal on Uncertainty Fuzzynessand Knowledge-based System pp 571-588 year 2002 World ScientificPublishing Co Inc

[93] R Srikant and R Agrawal Mining Generalized Association Rules Pro-ceedings of the 21th International Conference on Very Large Data Basespp 407-419 September 1995 Morgan Kaufmann

[94] G K Tayi D P Ballou Examining Data Quality Comm of the ACMVol 41 Issue 2 pp 54-58 year 1998 ACM Press

[95] M Trottini A Decision-Theoretic Approach to data Disclosure ProblemsResearch in Official Statistics vol 4 pp 722 year 2001

[96] M Trottini Decision models for data disclosure lim-itation Carnegie Mellon University Available athttpwwwnissorgdgiiTRThesisTrottini -finalpdf year2003

[97] University of Milan - Computer Technology Institute - Sabanci UniversityCodmine IST project 2002-2003

[98] J Vaidya and C Clifton Privacy Preserving Association Rule Miningin Vertically Partitioned Data In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining pp639-644 year 2002 ACM Press

[99] V S Verykios E Bertino I Nai Fovino L Parasiliti Y Saygin YTheodoridis State-of-the-art in Privacy Preserving Data Mining SIG-MOD Record 33(1) pp 50-57 year 2004 ACM Press

[100] V S Verykios A K Elmagarmid E Bertino Y Saygin and E DasseniAssociation Rule Hiding IEEE Transactions on Knowledge and DataEngineering year 2003 IEEE Educational Activities Department

[101] C Wallace and D Dowe Intrinsic classification by MML The Snobprogram In the Proceedings of the 7th Australian Joint Conference onArtificial Intelligence pp 37- 44 UNE World Scientific Publishing CoArmidale Australia 1994

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 53: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

BIBLIOGRAPHY 51

[102] G J Walters Philosophical Dimensions of Privacy An Anthology Cam-bridge University Press year 1984

[103] G J Walters Human Rights in an Information Age A PhilosophicalAnalysis chapter 5 University of Toronto Press year 2001

[104] Y Wand and R Y Wang Anchoring Data Quality Dimensions in On-tological Foundations Comm of the ACM Vol 39 Issue 11 pp 86-95Nov 1996 ACM Press

[105] R Y Wang and D M Strong Beyond Accuracy what Data QualityMeans to Data Consumers Journal of Management Information SystemsVol 12 Issue 4 pp 5-34 year 1996

[106] L Willenborg and T De Waal Elements of statistical disclosure controlLecture Notes in Statistics Vol155 Springer Verlag New York

[107] N Ye and X Li A Scalable Clustering Technique for Intrusion Signa-ture Recognition 2001 IEEE Man Systems and Cybernetics InformationAssurance Workshop West Point NY June 5-6 year 2001 IEEE Press

[108] M J Zaki S Parthasarathy M Ogihara and W Li New algorithms forfast discovery of association rules In Proceeding of the 8rd InternationalConference on KDD and Data Mining Newport Beach California August1997 AAAI Press

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 54: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

European Commission EUR 23070 EN ndash Joint Research Centre ndash Institute for the Protection and Security of the Citizen Title Privacy Preserving Data Mining a Data Quality Approach Authors Igor Nai Fovino and Marcelo Masera Luxembourg Office for Official Publications of the European Communities 2008 ndash 56 pp ndash 21 x 297 cm EUR ndash Scientific and Technical Research series ndash ISSN 1018-5593 Abstract Privacy is one of the most important properties an information system must satisfy A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of sanitizing the database in such a way to prevent the discovery of sensible information (eg association rules) A drawback of such algorithms is that the introduced sanitization may disrupt the quality of data itself In this report we introduce a new methodology and algorithms for performing useful PPDM operations while preserving the data quality of the underlying database

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 55: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

How to obtain EU publications Our priced publications are available from EU Bookshop (httpbookshopeuropaeu) where you can place an order with the sales agent of your choice The Publications Office has a worldwide network of sales agents You can obtain their contact details by sending a fax to (352) 29 29-42758

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf
Page 56: Privacy Preserving Data Mining, a Data Quality Approachpublications.jrc.ec.europa.eu/.../ppdm_data_quality... · data quality context, Orr notes that data quality and data privacy

The mission of the JRC is to provide customer-driven scientific and technical support for the conception development implementation and monitoring of EU policies As a service of the European Commission the JRC functions as a reference centre of science and technology for the Union Close to the policy-making process it serves the common interest of the Member States while being independent of special interests whether private or national

  • Data_Quality_approach_coverpdf
  • PPDM_Data_Quality_Approach2pdf
  • Data_quality_approach_endpdf