Privacy-oriented Data Mining by Proof Checking

25
Privacy-oriented Data Mining by Proof Checking Stan Matwin (joint work with Amy Felty ) SITE University of Ottawa, Canada [email protected]

description

Privacy-oriented Data Mining by Proof Checking. Stan Matwin (joint work with Amy Felty ) SITE University of Ottawa, Canada [email protected]. The TAMALE Group. 4 profs Some 30 graduate students Areas: machine learning, data mining, text mining, NLP, data warehousing Research in - PowerPoint PPT Presentation

Transcript of Privacy-oriented Data Mining by Proof Checking

Page 1: Privacy-oriented Data Mining  by Proof Checking

Privacy-oriented Data Mining by Proof Checking

Stan Matwin(joint work with Amy Felty )

SITE

University of Ottawa, Canada [email protected]

Page 2: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 2

The TAMALE Group

• 4 profs• Some 30 graduate students• Areas: machine learning, data mining, text mining,

NLP, data warehousing• Research in

– Inductive Logic Programming– Text mining– Learning in the presence of knowledge– Applications of ML/DM (e.g. in SE: tools for maintenance

personnel)

Page 3: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 3

• Why did I get into this research?• what is already being done… and why it ’s not enough• the main idea

• its operation• discussion – ’correctness  ’

• prototype - Coq and CIC• example

• some technical challenges• acceptance?

Page 4: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 4

Some useful concepts...

• opting out vs opting in

• Use Limitation Principle: data should be used only for the explicit purpose for which it has been collected

Page 5: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 5

…and existing technical proposals

On the web: P3P Platform for Privacy Preferences

• W3C standard• XML specifications - on websites and in

browsers - of what can be collected and for what purpose - ULP?

• Handles cookies • Data exchange protocol more than privacy

protocol: no provisions for opting out after an initial opt-in

• the ULP part is in NL…not verifiable

Page 6: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 6

• data is perturbed by random distortion: xi xi +r

r uniform or gaussian• a procedure to reconstruct a PAC-

esitimation of the original distribution (but not the values)

• a procedure to build an accurate decision tree on the perturbed distribution

Agrawal ’s data perturbation transformations

Page 7: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 7

Agrawal ’s transformations cont’d

• proposes a measure to quantify privacy –estimate intervals and their size

• lately extended to non-numerical attributes, and to association rules

• does not address the ULP

• how do we know it is applied?

Page 8: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 8

• User sets permissions: what can and cannot be done with her data

• Any claim that a software respects these permissions is a proof of a theorem about the software

• Verifying the claim is then checking that proof against the software

the main idea: towards a verifiable ULP

Page 9: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 9

Who are the players?

• User C

• Data miner Org

• Data mining software developer Dev

• Independent verifier Veri

• …BUT no one owns the data D

Page 10: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 10

PC(D,A): C’s permissionsT(PC,S): theorem that S respects PC

R(PC,S): proof of T(PC,S): B: binary code of S

D: database schemeA: given set of database and

data mining operationsS: source code for A

Page 11: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 11

Discussion - properties

• It can be proven that C ’s permissions are respected (or not): PC is in fact a verifiable ULP

• PC can be negative (out) or positive (in)

• proof construction needs to be done only once for a given PC ,D and A

• Scheme is robust against cheating by Dev or Org

Page 12: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 12

Acceptance issues

• No Org will give Veri access to S

• Too much overhead to check R(PC,S) for each task, and each user

• Too cumbersome for C

• Based on all Orgs buying in

Page 13: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 13

• Veri needs – PC from C– R(S, PC) from Dev– S from Dev– B from Org

• Veri could check R(S, PC) at Dev’s• Veri needs to verify that S (belonging

normally to Dev) corresponds to B that Org runs.

Acceptance1:Veri’s operation- access

Page 14: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 14

Acceptance2: overhead

• Veri runs proof checking on a control basis

• Org’s execution ovhd ?

Page 15: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 15

Issues

• Naming the fields: XML or disclosure

• restricted class of theorems for a given P-automating proof techniques for this class

Page 16: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 16

Acceptance3: C’s perspective

• Building PCs must be easy for C, based on D and processing schema; initially a closed set?

• permissions could be encoded on a credit card, smart card, in the electronic wallet

• or in the CA; they can then be dynamically modified and revoked

Page 17: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 17

« Political »  aspects: who is Veri?

• generally trusted – – « consumer association »? – « Ralph Nader »?– « transparency international »?

• IT expert at the level of instrumenting and running the proof checker – connection to Open Software Foundation?

• theorem proving can be cast as « better testing »

Page 18: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 18

how to make Orgs buy in?

• The first Org is needed to volunteer

• a Green Data Mining logo will be granted and administered (verified) by Veri

• other Orgs will have an incentive to join

Page 19: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 19

Future work

• Build the tools

• expand the prototype

• extend from Weka to commercial data mining packages

• Integrate with P3P?

• find a willing Org

Page 20: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 20

Page 21: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 21

Link between S and B

• compilation not an option

• watermarking solution: B is watermaked by a slightly modified compiler with MD5(tar(S)) =128 bytes

• marks are inserted by a trusted makefile-and-compiler in locations in B given by Veri and unknown to Org

Page 22: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 22

Link…

• Veri, given access to S, can verify that B corresponds to S

• An attack by I requires hacking the compiler

• An attack by Org requires knowing the locations of watermarks

Page 23: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 23

Example

C restricts her Employee data from participating in a join with her Payroll data

Record Payroll : Set :=

mkPay{PID : nat; JoinInd : bool; Position : string; Salary: nat}.

Record Employee : Set :=

mkEmp{Name : string; EID : nat; …}.

Record Combined : Set :=

mkComb{CID : nat; CName : string; Csalary: nat; …}.

Page 24: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 24

Fixpoint Join [Ps: list Payroll]: (list Employee) (list Combined) :=

[Es : list Employee] Cases Ps of

nil (nil Combined) | (cons p ps) (app

(check_JoinInd_and_find_employee_record p Es) (Join ps Es))

end.

(check_JoinInd_and_find_employee_record p Es)

if a record is found in Es whose EID matches Ps PID and JoinInd permits Join, then a list of length 1 with the result of Join is returned, otherwise empty

Page 25: Privacy-oriented Data Mining  by Proof Checking

UBC 29/7/03 25

Definition Pc:=[S:((list Payroll)(list Employee) (list Combined)) Prop]

Ps:list Payroll}. Es:list Employee. (UniqueJoinInd Ps) P: Payroll.(In P Ps) ((JoinInd P)=false

not C:Combined ((In C (S Ps Es)) ((CID C)=(PID P)))

• PC(S) is written as (PC Join): Coq expands the definition of PC and provides the theorem

• request to proof checking operator of Coq will check this proof i.e; it will check that the user permissions are encoded into the Join program given

• Whole proof: 300 lines of Coq code; proof checking: 1 sec on a 600MHz machine