Data Mining and Privacy: Threat or Opportunity?usjpciip/CliftonC.pdfNSF-Japan CIIP Is Data Mining a...
-
Upload
vuongtuyen -
Category
Documents
-
view
216 -
download
2
Transcript of Data Mining and Privacy: Threat or Opportunity?usjpciip/CliftonC.pdfNSF-Japan CIIP Is Data Mining a...
NSF-Japan CIIP
Public (mis)Perception ofData Mining: Attack on Privacy
• Fears of loss of privacy constrain data mining
– Protests over a National Registry
• In Japan
– Data Mining Moratorium Act
• Would stop all data mining R&D by DoD
• But data mining gives summary results
– Does this violate privacy?
NSF-Japan CIIP
Is Data Mining a Threat?
• Data Mining summarizes data– Possible exception: Anomaly / Outlier
detection
• Summaries aren’t private– Or are they?
– Does generating them raise issues?
• We have to sell data mining as the privacy solution– Data mining enables safe use of private data
NSF-Japan CIIP
Public Problems withData Mining
The problem isn’t Data Mining, it is the infrastructure to support it!
• Japanese registry data already held by prefectures– Protests arose over moving to a National registry
• Total Information Awareness program doesn’t generate new data– Goal is to enable use of data from multiple agencies
• Loss of Separation of Control– Increases potential for misuse
NSF-Japan CIIP
Privacy vs. Confidentiality
• Privacy: I want information about me to be used only for my benefit
• Confidentiality: I want information to go only to those authorized
• Cryptography community understands confidentiality– Solid, vetted definitions
– Proof techniques
• Not sure if anybody really understands privacy– But confidentiality often sufficient
NSF-Japan CIIP
Classes of Solutions
• Data Obfuscation
– Nobody sees the real data
• Summarization
– Only the needed facts are exposed
• Data Separation
– Data remains with trusted parties
NSF-Japan CIIP
Confidential Computation
• Idea: Many parties have components of the input to a function
– Want to get the function result
– But not reveal your input component
• Preserves confidentiality of the data
– Unless disclosure inherent in the result
• Example: Secure Sum
– Two parties: disclosure inherent in result
NSF-Japan CIIP
How is this related to Infrastructure Protection?
• Critical Infrastructure not monolithic
– Telecommunications / power interrelated
– Multiple ISPs
• Protecting the infrastructure requires sharing information
– Attack identification and isolation
• Competitors reluctant to share
– Need data analysis without data disclosure!
NSF-Japan CIIP
How is this related to Infrastructure Protection? (2)
• Critical Information Infrastructure
– Information is critical, not the infrastructure
• Confidentiality important
– Want to reveal which power line needs to be cut to take down the whole grid?
• Analysis without disclosure adds safety
– Adversary must break multiple systems to gain access to all data
NSF-Japan CIIP
Secure Multiparty Computation: Definitions
• Secure
– Nobody knows anything but their own input and the results
– Formally: ∃ polynomial time S such that{S(x,f(x,y))} {View(x,y)}
• This gives us a way to prove confidentiality
– Construct a simulator
• Key ideas
– Distribution of simulator and view the same
– is computational indistinguishability
NSF-Japan CIIP
Secure Multiparty ComputationIt can be done!
• Goal: Compute function when each party has some of the inputs
• Yao’s Millionaire’s problem (Yao ’86)– Secure computation possible if function can
be represented as a circuit
– Idea: Securely compute gate• Continue to evaluate circuit
• Works for multiple parties as well (Goldreich, Micali, and Wigderson ’87)
b1a1 b2a2A=a1+a2 B=b1+b2
How does it work?
• Each side has input, knows circuit to compute function
• Add random value to your input, give to other side– Each side has share of all
inputs
• Compute share of output– Add results at end
• XOR gate: just add locally• AND gate: send your share
encoded in truth table– Oblivious transfer allows other
side to get only correct value out of truth table C=c1+c2
c1 c2
Circuit
c1+(a1+1)(b1+1)c1+(a1+1)b1c1+a1(b1+1)c1+a1b1value of output
4321OT-input
(1,1)(1,0)(0,1)(0,0)value of (a2,b2)
NSF-Japan CIIP
Oblivious Transfer
• What is it?– A has inputs ai
– B makes choice– A doesn’t know choice, B only sees chosen value.
• How?– A sends public key p to B– B selects 4 random values b
• encrypts (only) bchoice with fp, sends all to A
– A decrypts all with private key, sends to B:ci = ai ⊕ e(fp
-1(bi))– B outputs cchoice ⊕ e(bchoice) =
achoice ⊕ e(fp-1(fp(bchoice))) ⊕ e(bchoice)
NSF-Japan CIIP
• Association rules a common data mining task
– Find A, B, C such that AB ⇒ C holds frequently (e.g. Diapers ⇒ Beer)
• Fast algorithms for centralized and distributed computation
– Basic idea: For AB ⇒ C to be frequent, AB, AC, and BC must all be frequent
– Require sharing data
• Secure Multiparty Computation too expensive
Association Rules
NSF-Japan CIIP
Association Rule Mining:Horizontal Partitioning
• Distributed Association Rule Mining: Easy without sharing the individual data [Cheung+’96]
(Exchanging support counts is enough)
• What if we do not want to reveal which rule is supported at which site, the support count of each rule, or database sizes?
• Hospitals want to participate in a medical study
• But rules only occurring at one hospital may be a result of bad practices
• Is the potential public relations / liability cost worth it?
NSF-Japan CIIP
Overview of the Method(Kantarcioglu and Clifton ’02)
• Find the union of the locally large candidate itemsets securely
• After the local pruning, compute the globally supported large itemsets securely
• At the end check the confidence of the potential rules securely
NSF-Japan CIIP
E1(ABC)
E3(E1(ABC))E2(E3(E1(ABC)))
Computing Candidate Sets
2
ABD
1
ABC
3
ABC
E2(E3(ABC))
E2(E3(ABD))
E3(ABC)
E3(ABD)
ABCABD
NSF-Japan CIIP
Compute Which Candidates Are Globally Supported?
• Goal: To check whetherX.sup (1)
(2)
(3)
Note that checking inequality (1) is equivalent to
checking inequality (3)
∑=
≥n
i
iDBs1
*
0|)|*sup.(
||*sup.
1
11
≥−
≥
∑
∑∑
=
==
i
n
i
i
n
i
i
n
i
i
DBsX
DBsX
NSF-Japan CIIP
Which Candidates Are Globally Supported? (Continued)
• Now securely compute Sum ≥ 0:• Site0 generates random R
Sends R+count0 – frequency*dbsize0 to site1
• Sitek adds countk – frequency*dbsizek, sends to sitek+1
• Final result: Is sum at siten - R ≥ 0?• Use Secure Two-Party Computation
• This protocol is secure in the semi-honest model
NSF-Japan CIIP
Computing Frequent:Is ABC ≥ 5%?
2
ABC=9DBSize=200
1
ABC=18DBSize=300
3
ABC=5DBSize=100
ABC: R+count-freq.*DBSize
R=17
ABC: 17+9-.05*200
ABC: 16+18-.05*300
ABC: 19 ≥ R?
ABC: YES!
NSF-Japan CIIP
Horizontal Partitioning of Data
AmountDelinquent?Active?CC#
Bank of America
Chase Manhattan
No
No
Yes
>$1000Yes919
$300-500No324
<$300Yes123
No
No
Yes
>$1000Yes8772
$300-500No4127
<$300Yes3450
NSF-Japan CIIP
Medical Records
DiabeticNo TumorPTR
NoNo TumorCAC
DiabeticYesRPJ
Cell Phone Data
NiCd3650PTR
nonenoneCAC
Li/Ion5210RPJ
Global Database ViewBatteryModelDiabetes?Brain Tumor?TID
Vertical Partitioning of Data
NSF-Japan CIIP
Association Rule Mining
• Find out if itemset {A1, B1} is frequent (i.e. If support of {A1, B1} ≥ k)
A B
• Support of itemset is defined as number of transactions in which all attributes of the itemset are present
• For binary data, support =|Ai Bi|.
1k5
1k4
0k3
0k2
1k1
A1Key
1k5
1k4
0k3
1k2
0k1
B1Key
NSF-Japan CIIP
• Idea based on TID-list representation of data
– Represent attribute A as TID-list Atid
– Support of ABC is | Atid Btid Ctid |
• Use a secure protocol to find size of set intersection to find candidate sets
Association Rule Mining
NSF-Japan CIIP
Cardinality of Set Intersection
• Use a secure commutative hash function
• Pohlig-Hellman Encryption
• Each party generates own encryption key
• All parties encrypt all the input sets
( )( )( ) ( )( )( )( )sets allin objectscommon #Result
21
=∴
= ⋯⋯⋯⋯∵ XEEEXEEE jilk
NSF-Japan CIIP
Cardinality of Set Intersection
• Hashing
– All parties hash all sets with their key
• Initial intersection
– Each party finds intersection of all sets (except its own)
• Final intersection
– Parties exchange the final intersection set, and compute the intersection of all sets
NSF-Japan CIIP
Other Results
• ID3 Decision Tree learning
– Horizontal Partitioning: Lindell&Pinkas ’00
– Also vertical partitioning (Du, Vaidya)
• K-Means / EM Clustering
• K-Nearest Neighbor
• Naïve Bayes, Bayes network structure
• Outlier detection
http://www.cs.purdue.edu/people/clifton#ppdm
NSF-Japan CIIP
Open Challenge: Do Results Compromise Privacy?
• Example: Association Rule– Professor ∧ U.S. ∧ Computer Science ⇒ Salary >
$60k– Doesn’t this violate privacy of salary?
• Idea: Think of data in three categories– Sensitive: We don’t want an adversary to know it– Public: We must assume the adversary may know it– Unknown: We can assume the adversary doesn’t
know it, but we don’t mind if they do
• Data mining model generates one from the other– Can we analyze the impact on Sensitive data?
• Answer not obvious (see our KDD’04 paper)
NSF-Japan CIIP
Next Steps
• Technically meaningful privacy definitions– Not all or nothing
– Cost of misuse vs. potential for misuse?
• Understand interplay between data mining, statistics, and privacy
• Establish (intellectual) standards for privacy– E.g., what the cryptography community has
done for confidentiality