Data Mining and Privacy: Threat or Opportunity?usjpciip/CliftonC.pdfNSF-Japan CIIP Is Data Mining a...

30
NSF-Japan CIIP Data Mining and Privacy: Threat or Opportunity? Chris Clifton

Transcript of Data Mining and Privacy: Threat or Opportunity?usjpciip/CliftonC.pdfNSF-Japan CIIP Is Data Mining a...

NSF-Japan CIIP

Data Mining and Privacy:Threat or Opportunity?

Chris Clifton

NSF-Japan CIIP

Public (mis)Perception ofData Mining: Attack on Privacy

• Fears of loss of privacy constrain data mining

– Protests over a National Registry

• In Japan

– Data Mining Moratorium Act

• Would stop all data mining R&D by DoD

• But data mining gives summary results

– Does this violate privacy?

NSF-Japan CIIP

Is Data Mining a Threat?

• Data Mining summarizes data– Possible exception: Anomaly / Outlier

detection

• Summaries aren’t private– Or are they?

– Does generating them raise issues?

• We have to sell data mining as the privacy solution– Data mining enables safe use of private data

NSF-Japan CIIP

Public Problems withData Mining

The problem isn’t Data Mining, it is the infrastructure to support it!

• Japanese registry data already held by prefectures– Protests arose over moving to a National registry

• Total Information Awareness program doesn’t generate new data– Goal is to enable use of data from multiple agencies

• Loss of Separation of Control– Increases potential for misuse

NSF-Japan CIIP

Privacy vs. Confidentiality

• Privacy: I want information about me to be used only for my benefit

• Confidentiality: I want information to go only to those authorized

• Cryptography community understands confidentiality– Solid, vetted definitions

– Proof techniques

• Not sure if anybody really understands privacy– But confidentiality often sufficient

NSF-Japan CIIP

Classes of Solutions

• Data Obfuscation

– Nobody sees the real data

• Summarization

– Only the needed facts are exposed

• Data Separation

– Data remains with trusted parties

NSF-Japan CIIP

Confidential Computation

• Idea: Many parties have components of the input to a function

– Want to get the function result

– But not reveal your input component

• Preserves confidentiality of the data

– Unless disclosure inherent in the result

• Example: Secure Sum

– Two parties: disclosure inherent in result

NSF-Japan CIIP

How is this related to Infrastructure Protection?

• Critical Infrastructure not monolithic

– Telecommunications / power interrelated

– Multiple ISPs

• Protecting the infrastructure requires sharing information

– Attack identification and isolation

• Competitors reluctant to share

– Need data analysis without data disclosure!

NSF-Japan CIIP

How is this related to Infrastructure Protection? (2)

• Critical Information Infrastructure

– Information is critical, not the infrastructure

• Confidentiality important

– Want to reveal which power line needs to be cut to take down the whole grid?

• Analysis without disclosure adds safety

– Adversary must break multiple systems to gain access to all data

NSF-Japan CIIP

Gold Standard:Trusted Third Party

NSF-Japan CIIP

Secure Multiparty Computation: Definitions

• Secure

– Nobody knows anything but their own input and the results

– Formally: ∃ polynomial time S such that{S(x,f(x,y))} {View(x,y)}

• This gives us a way to prove confidentiality

– Construct a simulator

• Key ideas

– Distribution of simulator and view the same

– is computational indistinguishability

NSF-Japan CIIP

Secure Multiparty ComputationIt can be done!

• Goal: Compute function when each party has some of the inputs

• Yao’s Millionaire’s problem (Yao ’86)– Secure computation possible if function can

be represented as a circuit

– Idea: Securely compute gate• Continue to evaluate circuit

• Works for multiple parties as well (Goldreich, Micali, and Wigderson ’87)

b1a1 b2a2A=a1+a2 B=b1+b2

How does it work?

• Each side has input, knows circuit to compute function

• Add random value to your input, give to other side– Each side has share of all

inputs

• Compute share of output– Add results at end

• XOR gate: just add locally• AND gate: send your share

encoded in truth table– Oblivious transfer allows other

side to get only correct value out of truth table C=c1+c2

c1 c2

Circuit

c1+(a1+1)(b1+1)c1+(a1+1)b1c1+a1(b1+1)c1+a1b1value of output

4321OT-input

(1,1)(1,0)(0,1)(0,0)value of (a2,b2)

NSF-Japan CIIP

Oblivious Transfer

• What is it?– A has inputs ai

– B makes choice– A doesn’t know choice, B only sees chosen value.

• How?– A sends public key p to B– B selects 4 random values b

• encrypts (only) bchoice with fp, sends all to A

– A decrypts all with private key, sends to B:ci = ai ⊕ e(fp

-1(bi))– B outputs cchoice ⊕ e(bchoice) =

achoice ⊕ e(fp-1(fp(bchoice))) ⊕ e(bchoice)

NSF-Japan CIIP

• Association rules a common data mining task

– Find A, B, C such that AB ⇒ C holds frequently (e.g. Diapers ⇒ Beer)

• Fast algorithms for centralized and distributed computation

– Basic idea: For AB ⇒ C to be frequent, AB, AC, and BC must all be frequent

– Require sharing data

• Secure Multiparty Computation too expensive

Association Rules

NSF-Japan CIIP

Association Rule Mining:Horizontal Partitioning

• Distributed Association Rule Mining: Easy without sharing the individual data [Cheung+’96]

(Exchanging support counts is enough)

• What if we do not want to reveal which rule is supported at which site, the support count of each rule, or database sizes?

• Hospitals want to participate in a medical study

• But rules only occurring at one hospital may be a result of bad practices

• Is the potential public relations / liability cost worth it?

NSF-Japan CIIP

Overview of the Method(Kantarcioglu and Clifton ’02)

• Find the union of the locally large candidate itemsets securely

• After the local pruning, compute the globally supported large itemsets securely

• At the end check the confidence of the potential rules securely

NSF-Japan CIIP

E1(ABC)

E3(E1(ABC))E2(E3(E1(ABC)))

Computing Candidate Sets

2

ABD

1

ABC

3

ABC

E2(E3(ABC))

E2(E3(ABD))

E3(ABC)

E3(ABD)

ABCABD

NSF-Japan CIIP

Compute Which Candidates Are Globally Supported?

• Goal: To check whetherX.sup (1)

(2)

(3)

Note that checking inequality (1) is equivalent to

checking inequality (3)

∑=

≥n

i

iDBs1

*

0|)|*sup.(

||*sup.

1

11

≥−

∑∑

=

==

i

n

i

i

n

i

i

n

i

i

DBsX

DBsX

NSF-Japan CIIP

Which Candidates Are Globally Supported? (Continued)

• Now securely compute Sum ≥ 0:• Site0 generates random R

Sends R+count0 – frequency*dbsize0 to site1

• Sitek adds countk – frequency*dbsizek, sends to sitek+1

• Final result: Is sum at siten - R ≥ 0?• Use Secure Two-Party Computation

• This protocol is secure in the semi-honest model

NSF-Japan CIIP

Computing Frequent:Is ABC ≥ 5%?

2

ABC=9DBSize=200

1

ABC=18DBSize=300

3

ABC=5DBSize=100

ABC: R+count-freq.*DBSize

R=17

ABC: 17+9-.05*200

ABC: 16+18-.05*300

ABC: 19 ≥ R?

ABC: YES!

NSF-Japan CIIP

Horizontal Partitioning of Data

AmountDelinquent?Active?CC#

Bank of America

Chase Manhattan

No

No

Yes

>$1000Yes919

$300-500No324

<$300Yes123

No

No

Yes

>$1000Yes8772

$300-500No4127

<$300Yes3450

NSF-Japan CIIP

Medical Records

DiabeticNo TumorPTR

NoNo TumorCAC

DiabeticYesRPJ

Cell Phone Data

NiCd3650PTR

nonenoneCAC

Li/Ion5210RPJ

Global Database ViewBatteryModelDiabetes?Brain Tumor?TID

Vertical Partitioning of Data

NSF-Japan CIIP

Association Rule Mining

• Find out if itemset {A1, B1} is frequent (i.e. If support of {A1, B1} ≥ k)

A B

• Support of itemset is defined as number of transactions in which all attributes of the itemset are present

• For binary data, support =|Ai Bi|.

1k5

1k4

0k3

0k2

1k1

A1Key

1k5

1k4

0k3

1k2

0k1

B1Key

NSF-Japan CIIP

• Idea based on TID-list representation of data

– Represent attribute A as TID-list Atid

– Support of ABC is | Atid Btid Ctid |

• Use a secure protocol to find size of set intersection to find candidate sets

Association Rule Mining

NSF-Japan CIIP

Cardinality of Set Intersection

• Use a secure commutative hash function

• Pohlig-Hellman Encryption

• Each party generates own encryption key

• All parties encrypt all the input sets

( )( )( ) ( )( )( )( )sets allin objectscommon #Result

21

=∴

= ⋯⋯⋯⋯∵ XEEEXEEE jilk

NSF-Japan CIIP

Cardinality of Set Intersection

• Hashing

– All parties hash all sets with their key

• Initial intersection

– Each party finds intersection of all sets (except its own)

• Final intersection

– Parties exchange the final intersection set, and compute the intersection of all sets

NSF-Japan CIIP

Other Results

• ID3 Decision Tree learning

– Horizontal Partitioning: Lindell&Pinkas ’00

– Also vertical partitioning (Du, Vaidya)

• K-Means / EM Clustering

• K-Nearest Neighbor

• Naïve Bayes, Bayes network structure

• Outlier detection

http://www.cs.purdue.edu/people/clifton#ppdm

NSF-Japan CIIP

Open Challenge: Do Results Compromise Privacy?

• Example: Association Rule– Professor ∧ U.S. ∧ Computer Science ⇒ Salary >

$60k– Doesn’t this violate privacy of salary?

• Idea: Think of data in three categories– Sensitive: We don’t want an adversary to know it– Public: We must assume the adversary may know it– Unknown: We can assume the adversary doesn’t

know it, but we don’t mind if they do

• Data mining model generates one from the other– Can we analyze the impact on Sensitive data?

• Answer not obvious (see our KDD’04 paper)

NSF-Japan CIIP

Next Steps

• Technically meaningful privacy definitions– Not all or nothing

– Cost of misuse vs. potential for misuse?

• Understand interplay between data mining, statistics, and privacy

• Establish (intellectual) standards for privacy– E.g., what the cryptography community has

done for confidentiality