Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1...

59
Directions in Big Data Anonymisation Directions in Big Data Anonymisation Josep Domingo-Ferrer Universitat Rovira i Virgili, Tarragona, Catalonia [email protected] Cambridge, the 5th of December, 2016 1 / 59

Transcript of Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1...

Page 1: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Directions in Big Data Anonymisation

Josep Domingo-Ferrer

Universitat Rovira i Virgili, Tarragona, Catalonia

[email protected]

Cambridge, the 5th of December, 2016

1 / 59

Page 2: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

1 Introduction

2 Big data, law and ethics

3 Nihilists: no privacy possible with big data

4 Fundamentalists: privacy even if data become useless

5 Desiderata in big data anonymization

6 Big data protection under k-anonymity

7 Big data protection under differential privacy

8 Transparent, local and collaborative anonymization

9 Conclusions and further research

2 / 59

Page 3: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Introduction

Introduction

Big data have come true with the new millennium.

Any human activity leaves a digital track that someonecollects and stores:

Sensors of the Internet of ThingsSocial mediaMachine-to-machine communicationMobile video, etc.

3 / 59

Page 4: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Introduction

Example: Big data from Internet

4 / 59

Page 5: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Introduction

Example: Big data from smart cities and IoT

5 / 59

Page 6: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Introduction

Distinguishing features of big data

Volume The volume of data in the digital universe reached9.5 billion petabytes in 2015 (9.5× 1024 bytes) withan increase of 3 billion petabytes over 2014 (Meeker2016).

Velocity Most data are no longer static data sets, butdynamic data. On-line data can be harvested atmillions of events per second (e.g. by sensors).

Variety Data come from several sources and in differentformats (numerical, categorical, unstructured text,audio, video, etc.).

6 / 59

Page 7: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Introduction

New big data technologies

Storage. New technologies have arisen to replace thetraditional structured storage, like Hadoop, NoSQL,MapReduce, etc.

Data science. Conventional statistics tries to infer populationproperties from (small) samples. Data science leveragespractically all the data of the population of interest. Datavariety and volume allow new and very sophisticated analyses.

7 / 59

Page 8: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Introduction

Big data threat on privacy

While big data are very valuable in many fields, theyincreasingly threaten the privacy of individuals on whom theyare collected (often unawares of these).

E.g. a retail chain’s prediction model guessed the pregnancyof a teenager before her parents did (Duhigg 2012).

8 / 59

Page 9: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Introduction

Statistical disclosure control

Statisticians and computer scientists have worried aboutdisclosure risk since the time of small data.

Statistical disclosure control (SDC, Hundepool et al. (2012))seeks to allow useful inferences on data while preserving theprivacy of the subjects to whom the records correspond.

SDC techniques are available for microdata (data sets withrecords corresponding to individuals), tabular data and on-linequeryable databases.

SDC-protected data are also called anonymized data.

9 / 59

Page 10: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Introduction

Utility-first and privacy-first SDC

Utility-first anonymization (iteratively changing parametersuntil empirical disclosure risk is low enough, as usual in officialstatistics) is slow and lacks formal privacy guarantees.

Privacy-first anonymization (based on enforcing a privacymodel, like k-anonymity, t-closeness or ε-differential privacy)may lead to poor data utility/linkability depending on theparameter choice.

10 / 59

Page 11: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data, law and ethics

Big data, law and ethics

Big data involve collecting all possible data and extractingknowledge from them, possibly using innovative methods.

This conflicts with the privacy of individuals, especially thedata subject (consumer, citizen) is often unaware of providingher data.

The service provider obtains the data as a result of atransaction (e.g. on-line purchase), in return for a free service(e.g. social media or e-mail) or as a natural requirement for aservice (e.g. location when using GPS).

11 / 59

Page 12: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data, law and ethics

Personal data protection principles in the EU law

Personal data or, more precisely personally identifiable information(PII), mean any information related to an identified or identifiablenatural person.Principles applicable to PII before big data (Art. 29 DataProtection Working party, new General Data ProtectionRegulation, see D’Acquisto et al. (2015)):

Lawfulness (consent obtained or processing needed for: acontract or legal obligation or the subject’s vital interests or apublic interest or legitimate processor’s interests compatiblewith the subject’s rights)

Consent (simple, specific, informed and explicit)

Purpose limitation (legitimate and specified beforecollection)

12 / 59

Page 13: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data, law and ethics

Personal data protection principles in the EU law (II)

Necessity and data minimization (collect only what isneeded and keep only as long as needed)

Transparency and openness (subjects need to get infoabout collection and processing in a way they understand)

Individual rights (to access, rectify, erase/be forgotten)

Information security (collected data protected againstunauthorized access and processing, manipulation, loss,destruction, etc.)

Accountability (ability to demonstrate compliance withprinciples)

Data protection by design and by default (privacy built-infrom the start rather than added later)

13 / 59

Page 14: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data, law and ethics

Personal big data conflict with principles

Big data result from collecting and linking data from severalsources, often in a continuous way

Unless personal data are anonymized, potential conflicts withthe above principles:

Purpose limitation. Big data often used secondarily forpurposes not even known at collection time.Consent. If purpose is not clear, consent cannot be obtained.Lawfulness. Without purpose limitation and consent,lawfulness is dubious.Necessity and data minimization. Big data result preciselyfrom accumulating data for potential use.Individual rights. Individuals do not even know which dataare stored on them.Accountability. Compliance does not hold and hence cannotbe demonstrated.

14 / 59

Page 15: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data, law and ethics

Reactions to the big data vs privacy conflict

Minimize privacy. To avoid hindering technology development,privacy protection should be limited to preventingprivacy-damaging data uses. Data collection should be free orself-regulated.

Minimize collection. They regard data collection as theprimary privacy problem, and they advocate minimizing it.Indeed, there are threats tied to data collection itself.

15 / 59

Page 16: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data, law and ethics

Threats tied to data collection

Data violation. The more data are collected, the moreattractive they are to potential attackers.

Misuse by employees (Chen 2010). The data controller’semployees may misuse the data.

Undesired secondary use. Health data of an opposer tocontraceptives may be used to develop new contraceptives.

Changes in corporate practice. The privacy pledge by a datacollector may change (e.g. Whatsapp recently decidedone-sidedly to share the phone numbers of their users withFacebook).

Government access without due legal guarantees (Solove2011). For example, NSA access to data of users of the bigInternet companies.

16 / 59

Page 17: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data, law and ethics

The anonymization solution

Anonymization is a possible way to overcome the conflictbetween big data and privacy.

The conflict relates to PII, but one may think there are nolonger PII after anonymization, so that no protection isneeded.

Yet, anonymization of big data faces several challenges...

17 / 59

Page 18: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data, law and ethics

Challenges in big data anonymization

Too little anonymization, for example mere de-identification(suppression of identifiers), may be insufficient to preventre-identification (Barbaro and Zeller 2006);

This is especially problematic with big data, whose volumeand variety facilitate re-identifying subjects.

Too much anonymization may prevent data corresponding tothe same or similar subjects from being linked, which hindersbig data construction.

18 / 59

Page 19: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Nihilists: no privacy possible with big data

Nihilists: privacy must be sacrificed

Privacy to be sacrificed to security. Governments(anti-terrorist fight). Companies (biometric identification ofemployees or customers, which breaks privacy without alwaysguaranteeing more security).

Privacy to be sacrificed to functionality. Free web applicationsand mobile apps (search engines; Google Calendar, Streetview,Latitude, etc.).

Privacy to be sacrificed to functionality and security. Datacollected by Internet companies by means of free applicationsmay be leaked to governments (Snowden on NSA).

19 / 59

Page 20: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Nihilists: no privacy possible with big data

The pragmatic nihilists: data brokers

They give no arguments, but they collect all personal datathey possibly can find (web, social media, etc.) or buy (loyaltyprograms, on-line commerce, etc.).

They cluster all information corresponding to the sameperson, to get personal profiles.

They sell those profiles to whoever buys them, typicallypersonalized marketing companies.

Several data brokers operate in the U.S., among whichAcxiom accumulates data on over 700 million peopleworldwide (FTC 2014).

Data brokers threaten privacy even more than Internetcompanies, because the former are unknown to the public.

20 / 59

Page 21: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Nihilists: no privacy possible with big data

Data broker activity

21 / 59

Page 22: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Nihilists: no privacy possible with big data

The extreme nihilists

E.g. Teradata’s CTO:

They claim that aspiring to any privacy in the big data societyis delirious.

The best people can expect is for data collectors not tomisuse their data (which they cannot verify).

22 / 59

Page 23: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Fundamentalists: privacy even if data become useless

Fundamentalists: privacy even if data become useless

Statistical disclosure control was inaugurated by Dalenius(1977). See Hundepool et al. (2012) for a state of the art.

Later, privacy-preserving data mining (PPDM) arose incomputer science as a parallel to SDC (Agrawal and Srikant2000).

Computer scientists contributed the notion of privacy models,which specify ex ante parameterizable privacy guarantees.

They are enforced by using one (or several) anonymizationmethods.

Privacy models with very stringent privacy parameters mayrender data useless for exploratory analysis.

23 / 59

Page 24: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Fundamentalists: privacy even if data become useless

Privacy models: k-anonymity

k-Anonymity (Samarati & Sweeney 1998)

A data set is said to satisfy k-anonymity if each combination ofvalues of the quasi-identifier attributes in it is shared by at least krecords (k-anonymous class).

=⇒ Usually enforced via generalization and suppression inquasi-identifiers, but also reachable via microaggregation(Domingo-Ferrer and Torra 2005)

24 / 59

Page 25: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Fundamentalists: privacy even if data become useless

Example: a 2-anonymous data set

25 / 59

Page 26: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Fundamentalists: privacy even if data become useless

Privacy models that extend k-anonymity

l-Diversity (Machanavajjhala et al. 2006)

A data set is said to satisfy l-diversity if, for each group of recordssharing a combination of quasi-identifier attributes, there are atleast l “well-represented” values for each confidential attribute.

t-Closeness (Li et al. 2007)

A data set is said to satisfy t-closeness if, for each group of recordssharing a combination of quasi-identifier attributes, the distancebetween the distribution of the confidential attribute in the groupand the distribution of the attribute in the whole data set is nomore than a threshold t.

26 / 59

Page 27: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Fundamentalists: privacy even if data become useless

Privacy models: ε-differential privacy

ε-Differential privacy (Dwork 2006)

A randomized query function F gives ε-differential privacy if, for alldata sets D1, D2 such that one can be obtained from the other bymodifying a single record (neighbor data sets), and allS ⊂ Range(F )

Pr(F (D1) ∈ S) ≤ exp(ε)× Pr(F (D2) ∈ S).

Usually enforced via Laplacian noise addition.

Later extended for data set publishing (Soria-Comas et al.2014; Xiao et al. 2014; Xu et al. 2012; Zhang et al. 2014).

27 / 59

Page 28: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Fundamentalists: privacy even if data become useless

Privacy models: ε-differential privacy (II)

28 / 59

Page 29: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Fundamentalists: privacy even if data become useless

Relaxations of differential privacy

Strict differential privacy is problematic for the exploratoryuses typical in big data.

Relaxations of it are being proposed:

Mohan et al. (2012) claim that less protection is needed forolder data.Machanavajjhala and Kiefer (2015) restrict the definition ofneighbor data sets: the differences between the differingrecords are bounded.Dwork and Rothblum (2016) propose concentrated differentialprivacy, whereby the DP guarantee may be violated with asmall probability.

29 / 59

Page 30: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Desiderata in big data anonymization

Desiderata in big data anonymization

Anonymized big data that are published should yield resultssimilar to those obtained on the original big data for a broadrange of exploratory analyses.

They should not allow unequivocal reconstruction of anysubject’s profile.

A privacy model for big data should satisfy at least(Soria-Comas and Domingo-Ferrer 2015):

Composability(Quasi-)linear computational costLinkability

30 / 59

Page 31: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Desiderata in big data anonymization

Composability

A privacy model is composable if its privacy guarantee holds(perhaps in a limited way) after repeated application.

In other words, a privacy model is not composable if poolingindependently released data sets, each of which satisfies themodel separately, can lead to a violation of the model.

Composability can be evaluated between data sets satisfyingthe same privacy model, different privacy models, or betweenan anonymized data set and a non-anonymized data set (thelatter is the most demanding case).

Composability is needed to cope with the velocity and varietyfeatures of big data.

31 / 59

Page 32: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Desiderata in big data anonymization

(Quasi-)linear computational cost

Low cost is needed to cope with the volume feature of bigdata.

Normally, there are several SDC methods that can be used tosatisfy a privacy model.

The computational cost depends on the selected method.

The desirable costs would be O(n) or at most O(n log n), fora data set of n records.

For methods with higher cost, blocking can be used, but itcan damage the utility and/or privacy of the resulting data.

32 / 59

Page 33: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Desiderata in big data anonymization

Linkability

In big data, the information on a particular subject is collectedfrom several sources (variety feature of big data).

Hence, the ability to link records corresponding to the sameindividual or to similar individuals is critical.

Thus, anonymizing data at the source should preservelinkability to some extent.

But... linking records corresponding to the same subjectdecreases the subject’s privacy=⇒ the accuracy of linkage should be lower with anonymizeddata sets than with original data sets.

33 / 59

Page 34: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under k-anonymity

Big data protection under k-anonymity

In a context of big data, it is hard to determine the subset ofQI attributes (attributes that can be used by an attacker tolink with external identified databases).

The safest option is to consider that all attributes are QIattributes.

34 / 59

Page 35: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under k-anonymity

Composability of k-anonymity

k-Anonymity was designed to protect a single data set and isnot composable in principle.

If several k-anonymous data sets have been published thatshare some subjects, the attacker can mount an intersectionattack to discard some records in the k-anonymous classes asnot corresponding to the target subject (based on the latter’sconfidential attributes).

To reach composability, the controllers ought to coordinate sothat, for the subjects shared by two data sets, theirk-anonymous classes contain the same k subjects.

If such coordination is infeasible, see Domingo-Ferrer andSoria-Comas (2016) for alternative strategies.

35 / 59

Page 36: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under k-anonymity

Intersection attack against k-anonymity

R1, . . . ,Rn ← n independent data releasesP ← population consisting of subjects present in all R1, . . . ,Rn

for each individual i in P dofor j = 1 to n do

eij ←equivalence class of Rj associated to isij ←set of confidential values of eij

end forSi ← si1 ∩ si2 ∩ . . . ∩ sin

end forreturn S1, . . . ,S|P|

36 / 59

Page 37: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under k-anonymity

Computational cost of k-anonymity

k-Anonymity is attained by modifying the values of QIattributes either by combining generalization and suppression(Samarati and Sweeney 1998) or via microaggregation(Domingo-Ferrer and Torra 2005).

Optimal generalization/suppression and optimalmicroaggregation are NP problems.

Using heuristics and blocking one can reach O(n log n)complexities, where n is the number of records.

37 / 59

Page 38: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under k-anonymity

Linkability of k-anonymity

For a subject known to be in two k-anonymous data sets, wecan determine and link the corresponding k-anonymousclasses containing her.

If some of the confidential attributes are shared between thedata sets, the linkage accuracy improves (one can link withink-anonymous classes).

38 / 59

Page 39: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under k-anonymity

Summary on k-anonymity for big data

For k-anonymity to be composable, the controllers sharingsubjects must coordinate or follow suitable strategies.

There are quasi-linear heuristics for k-anonymity.

Linkability is possible at least at the k-anonymous class level.

With some coordination effort, k-anonymity is a reasonableoption to anonymize big data.

39 / 59

Page 40: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under differential privacy

Big data protection under differential privacy

ε-Differential privacy (DP) offers strong privacy guarantees.

The smaller ε, the more privacy.

DP can be reached via noise addition or by generatingsynthetic data from a differentially privacy model (e.g. ahistogram).

A synthetic data set can be either partially or fully synthetic.

In partial synthesis, only values deemed too sensitive arereplaced by synthetic data.

40 / 59

Page 41: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under differential privacy

Composability of DP: sequential composition

Sequential composition refers to a sequence of computations, eachof them providing differential privacy in isolation, providing alsodifferential privacy in sequence.

Theorem

Let κi (D), for some i ∈ I , be computations over D providingεi -differential privacy. The sequence of computations (κi (D))i∈Iprovides (

∑i∈I εi )-differential privacy.

41 / 59

Page 42: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under differential privacy

Composability of DP: parallel composition

Parallel composition refers to several ε-differentially privatecomputations each on data from a disjoint set of subjects yieldingε-differentially private output on the data from the pooled set ofsubjects.

Theorem

Let κi (Di ), for some i ∈ I , be computations over Di providingε-differential privacy. If each Di contains data on a set of subjectsdisjoint from the sets of subjects of Dj for all j 6= i , then(κi (Di ))i∈I provides ε-differential privacy.

42 / 59

Page 43: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under differential privacy

Composability of DP for data sets

Sequential composition. The release of εi -differentially privatedata sets Di , for some i ∈ I , is (

∑i∈I εi )-differentially private.

That is, by accumulating differentially private data about aset of individuals, differential privacy is not broken but thelevel of privacy decreases.

Parallel composition. The release of ε-differentially privatedata sets Di refering to disjoint sets of individuals, for somei ∈ I , is ε-differentially private.

43 / 59

Page 44: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under differential privacy

Computational cost of DP

DP by noise addition has linear cost O(n).

It has been suggested to use other methods to attain DP withimproved utility:

Data synthesis (Cormode et al. 2012; Zhang et al. 2014) has ahigher computational complexity.Microaggregation step prior to noise addition (Sanchez et al.2014; Soria-Comas et al. 2014) has complexity O(n2) orO(n log n) depending on whether blocking is used.

44 / 59

Page 45: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under differential privacy

Linkability of DP

In general, there is no linkability between two DP data setsgenerated via noise addition or as fully synthetic data.

Partially synthetic data sets, although they do not satisfystrict DP, allow accurate linkage.

45 / 59

Page 46: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Big data protection under differential privacy

Summary on DP for big data

DP has good composability properties, which may be suitableto anonymize dynamic data.

DP has also a low computational cost, which may be suitablefor very large data sets.

Linkability across differentially private data sets is only feasibleif the data sets share unaltered attributes.

The main problem with DP is that it does not providesignificant utility for exploratory analyses unless the εparameter is quite large.

46 / 59

Page 47: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Transparent, local and collaborative anonymization

Transparency to subjects and users

In a big data context, potentially many controllers dealingwith a subject’s data, and it cannot be assumed the datasubjects or users trust all involved controllers.

There is a need for anonymization to be transparent to bothusers and subjects.

The subject must be able to assess how much her data havebeen anonymized.

The user must be told the SDC methods and parametersused, except any random seeds, in order to maximize the datautility.

See transparency proposals in Domingo-Ferrer and Muralidhar(2016).

47 / 59

Page 48: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Transparent, local and collaborative anonymization

Local anonymization

Despite or because transparency, subjects may prefer to takecare of anonymizing their own data.

In a big data context, this is also good to relieve the datacontroller from the computational burden.

Local anonymization is an alternative SDC paradigm in whichthe subject anonymizes her own data record before handing itto the controller (Warner 1965; Agrawal and Haritsa 2005;Song and Ge 2014).

However, in local anonymization the subjects lacks a globalview of the data set, which may lead to overdoinganonymization and wasting data utility.

48 / 59

Page 49: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Transparent, local and collaborative anonymization

Collaborative anonymization

Proposed by Soria-Comas and Domingo-Ferrer (2015b), itcombines the low utility loss of centralized anonymization andthe high subject privacy of local anonymization.

It is based on the notion of co-utility (Domingo-Ferrer et al.2016).

Subjects collaborate to determine the disclosure riskassociated to their data and then locally apply the right levelof protection.

49 / 59

Page 50: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Conclusions and further research

Conclusions

There is a debate on whether big data are compatible withthe privacy of citizens.

There are two extreme positions: nihilism andfundamentalism.

We have tried to break new ground by opening a midway path.

We have stated the desirable properties of privacy models forbig data (composability, low computation, linkability).

We have examined how well the two main privacy models(k-anonymity and ε-differential privacy) satisfy thoseproperties.

We have also highlighted the need for transparency andperhaps for local and collaborative anonymization.

50 / 59

Page 51: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Conclusions and further research

Future research

This midway path is by no means ready.

Privacy models are needed that satisfy composability, lowcomputation, linkability and utility preservation for exploratoryanalyses.

The variety of big data goes beyond data sets formed byrecords: it includes video, audio, unstructured text, etc.,whose anonymization is quite challenging.

Privacy models and SDC methods must be able to cope withvelocity and volume: anonymizing dynamic big data is alargely unexplored territory.

Finally, collaborative anonymization is also attractive topreserve the self-determination of subjects without losingutility.

51 / 59

Page 52: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Conclusions and further research

References I

S. Agrawal and J.R. Haritsa (2005) A framework forhigh-accuracy privacy-preserving data mining, in ICDE’05,IEEE, pp. 193-204.

R. Agrawal and R. Srikant (2000) Privacy-preserving datamining, in ACM SIGMOD’00, pp. 439-450.

M. Barbaro and T. Zeller (2006) A face is exposed for AOLsearcher no. 4417749, New York Times.

A. Chen (2010) Gcreep: Google engineer stalked teens, spiedon chats, Gawker.

G. Cormode, C. Procopiuc, D. Srivastava, E. Shen and T. Yu(2012) Differentially private spatial decompositions, inProceedings of the 2012 IEEE 28th International Conferenceon Data Engineering-ICDE12, Washington, DC, EUA, pp.20-31. IEEE Computer Society.

52 / 59

Page 53: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Conclusions and further research

References II

G. DAcquisto, J. Domingo-Ferrer, P. Kikiras, V. Torra, Y.-A.de Montjoye and A. Bourka (2015) Privacy by Design in BigData — An overview of privacy enhancing technologies in theera of big data analytics, European Union Agency for Networkand Information Security (ENISA).

T. Dalenius (1977) Towards a methodology for statisticaldisclosure control. Statistik Tidskrift 15:429-444.

J. Domingo-Ferrer and K. Muralidhar (2016) New directions inanonymization: permutation paradigm, verifiability by subjectsand intruders, transparency to users, Information Sciences337-338:11-24.

J. Domingo-Ferrer, D. Sanchez and J. Soria-Comas (2016)Co-utility: self-enforcing collaborative protocols with mutualhelp, Progress in Artificial Intelligence 5(2):105-110.

53 / 59

Page 54: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Conclusions and further research

References III

J. Domingo-Ferrer and J. Soria-Comas (2016) Anonymizationin the time of big data, in Privacy in Statistical Databases-PSD2016, Springer, pp. 225-236.

J. Domingo-Ferrer and V. Torra (2005) Ordinal, continuousand heterogeneous k-anonymity through microaggregation,Data Mining and Knowledge Discovery 11(2):195-212.

C. Duhigg (2012) How companies learn your secrets, New YorkTimes Magazine, Feb. 16.

C. Dwork (2006) Differential privacy, in ICALP06, LNCS 4052,Springer, pp. 1-12.

C. Dwork and G. N. Rothblum (2016) Concentrated differentialprivacy (v2), March 16, arXiv:1603.01887v2.

M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page and T.Ristenpart (2014) Privacy in pharmacogenetics: an end-to-endcase study of personalized warfarin dosing, in Proc. of the 23rdUSENIX Security Symposium, San Diego CA, EUA, pp. 17-32.

54 / 59

Page 55: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Conclusions and further research

References IV

FTC (2014) Data Brokers: A Call for Transparency andAccountability, US Federal Trade Commission.

A. Hundepool, J. Domingo-Ferrer, L. Franconi, S. Giessing, E.Schulte-Nordholt, K. Spicer and P.-P. de Wolf (2012)Statistical Disclosure Control, Wiley.

N. Li, T. Li and S. Venkatasubramanian (2007) t-Closeness:privacy beyond k-anonymity and l-diversity, in ICDE07, pp.106-115.

A. Machanavajjhala and D. Kiefer (2015) Designing statisticalprivacy for your data, Communications of the ACM58(3):58-67.

A. Machanavajjhala, D. Kifer, J. Gehrke and M.Venkitasubramaniam (2007) l-Diversity: privacy beyondk-anonymity, ACM Trans. Knowl. Discov. Data 1(1):3.

55 / 59

Page 56: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Conclusions and further research

References V

A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke and L.Vilhuber (2008) Privacy: theory meets practice on the map, inProceedings of the 2008 IEEE 24th Intl. Conf. on DataEngineering-ICDE’08, Washington, DC, USA. IEEE ComputerSociety, pp. 277286.

M. Meeker (2016) 2016 Internet Trends http:

//www.kpcb.com/blog/2016-internet-trends-report

P. Mohan, A. Thakurta, E. Shi, D. Song and D. E. Culler(2012) GUPT: privacy preserving data analysis made easy, inProc. of ACM SIGMOD’12, Scottsdale AZ.

P. Samarati and L. Sweeney (1998) Protecting Privacy whenDisclosing Information: k-Anonymity and its Enforcementthrough Generalization and Suppression, Technical Report, SRIInternational.

56 / 59

Page 57: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Conclusions and further research

References VI

D. Sanchez, J. Domingo-Ferrer and S. Martnez (2014)Improving the utility of differential privacy via univariatemicroaggregation, in Privacy in Statistical Databases-PSD2014, pp. 130-142. Springer.

D. J. Solove (2011) Nothing to Hide: the False TradeoffBetween Privacy and Security, New York: Yale UniversityPress.

C. Song and T. Ge (2014) Aroma: a new data protectionmethod with differential privacy and accurate query answering,in CIKM’14, ACM, pp. 1569-1578.

J. Soria-Comas, J. Domingo-Ferrer, D. Snchez and S. Martnez(2014) Enhancing data utility in differential privacy viamicroaggregation-based k-anonymity, VLDB Journal23(5):771-794.

57 / 59

Page 58: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Conclusions and further research

References VII

J. Soria-Comas and J. Domingo-Ferrer (2015) Big dataprivacy: challenges to privacy principles and models, DataScience and Engineering 1(1):21-28.

J. Soria-Comas and J. Domingo-Ferrer (2015b) Co-utilecollaborative anonymization of microdata, in MDAI 2015,LNCS 9321, Springer, pp. 192-2016.

S. L. Warner (1965) Randomized response: a survey techniquefor eliminating evasive answer bias, J. Am. Stat. Assoc.60:63-69.

X. Xiao and Y. Tao (2007) M-Invariance: towardsprivacy-preserving re-publication of dynamic datasets, inSIGMOD’07, ACM, pp. 689-700.

J. Xu, Z. Zhang, X. Xiao, Y. Yang and G. Yu (2012)Differentially private histogram publication, in Proceedings ofthe 2012 IEEE 28th Intl. Conf. on Data Engineering-ICDE’12,Washington, DC, USA. IEEE Computer Society, pp. 3243.

58 / 59

Page 59: Directions in Big Data Anonymisation · 2016-12-13 · Directions in Big Data Anonymisation 1 Introduction 2 Big data, law and ethics 3 Nihilists: no privacy possible with big data

Directions in Big Data Anonymisation

Conclusions and further research

References VIII

J. Zhang, G. Cormode, C.M. Procopiuc, D. Srivastava and X.Xiao (2014) Privbayes: private data release via Bayesiannetworks, in Proceedings of the 2014 ACM SIGMOD Intl.Conf. on Management of Data, SIGMOD’14, New York, NY,USA. ACM, pp. 14231434.

59 / 59