HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA...

62
HIDE: Privacy Preserving Medical Data Publishing James Gardner Department of Mathematics and Computer Science Emory University [email protected]

Transcript of HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA...

Page 1: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

HIDE: Privacy Preserving Medical Data Publishing

James GardnerDepartment of Mathematics and Computer Science

Emory [email protected]

Page 2: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Motivation• De-identification is critical in any health

informatics system

• Research

• Sharing

• Need an easy-to-use interface and framework for data custodians and publishers

• Understanding data is necessary to de-identify data

Page 3: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their

equivalent geocodes, except for the initial three digits of a zip code, if according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.

3. All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;

4. Phone numbers; 5. Fax numbers;

6. Electronic mail addresses; 7. Social Security numbers; 8. Medical record numbers;

9. Health plan beneficiary numbers; 10. Account numbers; 11. Certificate/license numbers;

12. Vehicle identifiers and serial numbers, including license plate numbers; 13. Device identifiers and serial numbers; 14. Web Universal Resource Locators (URLs); 15. Internet Protocol (IP) address numbers;

16. Biometric identifiers, including finger and voice prints; 17. Full face photographic images and any comparable images; and 18. Any other unique identifying number, characteristic, or code (note this does not mean the unique code assigned by

the investigator to code the data)

Page 4: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

PHI Summary• Protected Health Information (PHI) is

defined by HIPAA as individually identifiable health information

• Direct identifiers include name, SSN, etc.

• Indirect identifiers include gender, age, address information, etc.

Page 5: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Research Challenges• Detect PHI in heterogeneous medical

data

• Apply structured anonymization principles on heterogeneous medical data (micro-privacy)

• Release differentially private aggregated statistics (macro-privacy)

Page 6: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

HIDE• Health Information DE-identification

• Uses techniques from

• Information Extraction

• Data linking

• Structured Anonymization

• Differential Privacy

• Data Mining

Page 7: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

HIDE

Page 8: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Outline• Background and related work

• Existing de-identification approaches

• Named entity recognition

• Privacy preserving data publishing

• Proposed Work

• HIDE framework

• Identifying and sensitive information extraction

• Micro-data publishing

• Macro-data publishing

• Software

Page 9: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Alternative Systems• Scrub System - rules and dictionaries are used to

detect PHI

• Semantic Lexicon System - rules and dictionaries are used to detect PHI

• DE-ID - rules and dictionaries, developed at Pittsburgh and approved by IRB

• Concept-Match Scrubber - removes every word not in an approved list of non-identifying terms

• Carafe - uses a CRF to detect PHI

Page 10: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Limitations of Most Systems

• Lack portability

• Donʼt give formal privacy guarantees

• Donʼt utilize the latest work from structured data anonymization

• Focus only on removing PHI

Page 11: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Named Entity Recognition

• Locate and classify atomic elements in text into predefined categories such as person, organization, location, expressions of time, quantities, etc.

• NER systems can be classified into either:

• Rule-based

• Machine Learning-based

Page 12: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

NER Examples• Part-of-speech (POS) Tagging

• I/PRP think/VBP it/PRP ‘s/BES a/DT pretty/RB good/JJ idea/NN ./.

• Personal Health Identifier Detection

• <age>77</age> year old <gender>female</gender> with history of <disease>B-cell lymphoma</disease> (Marginal zone, <mrn>SH-04-4444</mrn>)

Page 13: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

NER Metrics

• Precision

• TP / (TP + FP)

• Recall

• TP / (TP + FN)

Page 14: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Rule-based• Rely on hand-coded rules and

dictionaries

• Dictionaries can be used for terms in a closed class with an exhaustive list, e.g. geographic locations

• Regular expressions are used to detect terms that follow certain syntactic patterns, e.g. phone numbers

Page 15: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Machine learning-based• Model the NER as a sequence labeling task

where each token is assigned a label

• Train classifiers to label each token

• Classifiers use a list of features (or attributes) for training and classification of the sequence

• Frequently applied classifiers are HMM, MEMM, SVM, and CRF

Page 16: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Conditional Random Field

• A Conditional Random Field (CRF) provides a probabilistic framework for labeling and segmenting sequential data

• A CRF defines a conditional probability of a label sequence given an observation sequence

Page 17: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Comparison• Rule-based

• Accurate

• Require experts to modify

• Not portable

• Machine learning-based

• Accurate

• Modification of models is done through training rather than “coding”

• Portable

Page 18: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Privacy PreservingData Publishing

• Weak privacy (Micro)

• release a modified version of each record according to a given anonymization principle

• assumes level of background knowledge

• Differential privacy (Macro)

• release perturbed statistics that satisfy the differential privacy principle

• no assumptions of background knowledge

Page 19: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Micro-data publishing• Prevent linking of records in separate

databases

• k-anonymization

• Prevent discovery of sensitive values

• l-diversity

• Prevent discovery of presence or absence in a database

• delta-presence

Page 20: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Micro-data publishing

decisions [6, 16, 10] in our future research agenda.

3.5. Anonymization

Once the identifier view is generated after attribute extraction and link-ing, we can perform attribute removal (suppression) to allow full de-identification(as possible) and partial de-identification. We also allow statistical de-identification through anonymization techniques through attribute general-ization that guarantees privacy based on a privacy principle while maintain-ing maximum data utility. Among the many privacy principles or criteria,k-anonymity [34] and its extension l-diversity [23] are the two most widelyaccepted and serve as the basis for many others, and hence, are used in ourinitial work. Below we illustrate the basic ideas behind these principles andpresent the anonymization approach we used.

Table 1: Illustration of AnonymizationName Age Gender Zipcode DiagnosisHenry 25 Male 53710 InfluenzaIrene 28 Female 53712 LymphomaDan 28 Male 53711 BronchitisErica 26 Female 53712 Influenza

Original DataName Age Gender Zipcode Disease! [25 " 28] Male [53710-53711] Influenza! [25 " 28] Female 53712 Lymphoma! [25 " 28] Male [53710-53711] Bronchitis! [25 " 28] Female 53712 Influenza

Anonymized Data

In defining anonymization given a relational table T , the attributes arecharacterized into three types. Unique identifiers are attributes that identifyindividuals. Quasi-identifier set is a minimal set of attributes that can bejoined with external information to re-identify individual records. We assumethat a quasi-identifier is recognized based on the domain knowledge. Sensitiveattributes are those attributes that an adversary should not be permitted touniquely associate their values with a unique identifier. Table 1 illustrates anoriginal relational table of personal information where Name is consideredas an identifier, (Age, Gender, Zipcode) a quasi-identifer set, and Diagnosisa sensitive attribute.

12

Name Age Gender Zipcode DiagnosisHenry 25 Male 53710 InfluenzaIrene 28 Female 53712 LymphomaDan 28 Male 53711 BronchitisErica 26 Female 53712 Influenza

Original DataName Age Gender Zipcode Disease∗ [25− 28] Male [53710-53711] Influenza∗ [25− 28] Female 53712 Lymphoma∗ [25− 28] Male [53710-53711] Bronchitis∗ [25− 28] Female 53712 Influenza

Anonymized Data

Page 21: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

k-anonymization• Quasi identifier set

• Sensitive attributes

• Table is k-anonymous if every record has k-1 other records with the same quasi-identifier set

• The probability of linking a victim to a specific record through QID is at most 1/k

Page 22: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

l-diversity

• Extension of k-anonymization

• Also ensures that each group has at least l distinct sensitive values

• Prevents disclosure of sensitive values

Page 23: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Macro-data publishing• Differential Privacy is a strong privacy notion

• Requires that a randomized computation yields nearly identical output when performed on nearly identical input

• Interactive model

• limited to a specific number of queries

• Non-interactive model

• need query strategies to build noisy data cubes that maximize utility for a random query workload

Page 24: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Differentially Private InterfaceDifferentially Private Histogram Release

Differentially Private

Interface

Original Data

Diff.Private

HistogramUser

Answers

Queries

QueryStrategy

Diff. PrivateAnswers

Pre-designed Queries

Workload

• Differentially private histogram release for random predicate queries

Page 25: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

HIDE Framework• Identifying and Sensitive Information Extraction

• uses state-of-the-art CRF model to extract PHI and sensitive information

• Data linking

• provides structured patient-centric view of the data

• De-identification and Anonymization

• Micro-data publication - uses data suppression and generalization to provide a k-anonymized view of the data

• Macro-data publication - release perturbed aggregated statistics from the patient-centric view

Page 26: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

HIDE

Page 27: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Identifying and sensitive information extraction

• Use CRF classifier to extract information

• Studied impact of features including:

• regular expressions

• affixes

• dictionaries

• context

• Sampling techniques to adjust classifier for higher precision or recall

Page 28: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

ExampleToken Label

77 B-age

year O

old O

female B-gender

with O

history O

Token Label

of O

B B-disease

- I-disease

cell I-disease

lymphoma I-disease

( O

Page 29: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Regular ExpressionsRegular Expression Name^[A-Za-z]$ ALPHA^[A-Z].*$ INITCAPS^[A-Z][a-z].*$ UPPER-LOWER^[A-Z]+$ ALLCAPS^[A-Z][a-z]+[A-Z][A-Za-z]*$ MIXEDCAPS^[A-Za-z]$ SINGLECHAR^[0-9]$ SINGLEDIGIT^[0-9][0-9]$ DOUBLEDIGIT^[0-9][0-9][0-9]$ TRIPLEDIGIT^[0-9][0-9][0-9][0-9]$ QUADDIGIT^[0-9,]+$ NUMBER[0-9] HASDIGIT^.*[0-9].*[A-Za-z].*$ ALPHANUMERIC^.*[A-Za-z].*[0-9].*$ ALPHANUMERIC^[0-9]+[A-Za-z]$ NUMBERS LETTERS^[A-Za-z]+[0-9]+$ LETTERS NUMBERS- HASDASH’ HASQUOTE/ HASSLASH‘~!@#$%\^&*()\-=_+\[\]{}|;’:\",./<>?]+$ ISPUNCT(-|\+)?[0-9,]+(\.[0-9]*)?%?$ REALNUMBER^-.* STARTMINUS^\+.*$ STARTPLUS^.*%$ ENDPERCENT^[IVXDLCM]+$ ROMAN^\s+$ ISSPACE

Table 1: List of regular expression features used in HIDE

Page 30: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Affixes

• Prefixes

• Suffixes

• All affixes up to size 3

Page 31: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Dictionaries• Company Names

• Male First Names

• Female First Names

• Last Names

• State Names

• State Abbreviations

• Hospital Names

Page 32: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Context

• Previous 4 words

• Next 4 words

• Occurrence counts

Page 33: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Feature VectorsToken CAPS? SPECIAL? PREVIOUS NEXT LABEL

77 N Y ? year B-age

year N N 77 old O

old N N year female O

female N N old with B-gender

with N N female history O

history N N with of O

of N N history B O

B Y N of - B-disease

- N Y B cell I-disease

cell N N - lymphoma I-disease

lymphoma N N cell ( I-disease

Page 34: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

• 220 re-identified pathology reports for i2b2 task

• 10-fold cross-validation

Features Set Results

Page 35: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Features Set Results

0.5

0.6

0.7

0.8

0.9

1

Precision 0.562 0.745 0.749 0.788 0.792 0.811 0.81 0.944 0.948 0.956 0.958 0.961 0.962 0.962 0.963Recall 0.623 0.832 0.839 0.847 0.853 0.868 0.868 0.967 0.969 0.975 0.977 0.982 0.982 0.982 0.984F-Score 0.591 0.786 0.792 0.816 0.821 0.838 0.838 0.955 0.958 0.965 0.967 0.971 0.972 0.972 0.973

d r rd a ad ra rad c cd ac acd rc rac racd rcd

Page 36: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Sampling• Honest brokers are concerned more

about recall than precision

• Cost proportionate rejection sampling is often used for boosting

• Training examples are selected based on the associated cost of missing that label

Page 37: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

• Keep all non-”O” labels

• Select “O” labels with given probability

• Biases the classifier to select a label other than “O”

Random O-Sampling

Page 38: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Random O-Samplingi2b2 PhysioNet

0.75

0.8

0.85

0.9

0.95

1

0 0.2 0.4 0.6 0.8 1

Sample Probability

prec recall f-score

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Sample Probability

prec recall f-score

Page 39: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Window Sampling

• Select all non-”O” labels

• Select all terms within given window of any non-”O” label

Page 40: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Window Sampling

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 10 20 30 40 50 60 70 80

History SizePrecision F-Score Recall

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80

History Size

Precision Recall F-Score

i2b2 PhysioNet

Page 41: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Information Extraction Conclusion• HIDE has a fast and accurate CRF for

detecting PHI

• Feature engineering has been explored in great detail

• Window Sampling can be used to adjust recall with minimal impact on precision

• Impact of training data size

Page 42: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Micro-data publishing• Release patient-centric view or original

data with suppressed or generalized values

• Apply k-anonymization and l-diversity principles to unstructured data

• Evaluate query accuracy on real medical data

Page 43: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Micro-data publishing• Full de-identification

• Remove all identifiers

• Partial de-identification

• Remove direct identifiers

• Statistical de-identification

• Statistical anonymization

Page 44: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Statistical Anonymization

• Partition the original data points into groups that will all share the same values with respect to QID

• Use multi-dimensional mondrian algorithm for releasing k-anonymized and l-diverse version of structured patient-centric view

Page 45: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Partitioning

28

27

26

25

537125371153710

(a) Patients

28

27

26

25

537125371153710

(b) Single-Dimensional

28

27

26

25

537125371153710

(c) Strict Multidimensional

Figure 4. Spatial representation of Patients and partitionings (quasi-identifiers Zipcode and Age)

that CAV G ≤ 1? To prove this claim, we show that there isa solution to the k-anonymous multidimensional partition-ing problem for P if and only if there is a solution to thepartition problem for A.

Suppose there exists a k-anonymous multidimen-sional partitioning for P . This partitioning must de-fine two multidimensional regions, R1 and R2, such that�

p∈R1count(p) =

�p∈R2

count(p) = k =�

ai

2 , andpossibly some number of empty regions. By the strictnessproperty, these regions must not overlap. Thus, the sum ofcounts for the two non-empty regions constitute the sum ofintegers in two disjoint complementary subsets of A, andwe have an equal partitioning of A.

In the other direction, suppose there is a solution tothe partition problem for A. For each binary partition-ing of A into disjoint complementary subsets A1 andA2, there is a multidimensional partitioning of P into re-gions R1, ..., Rn such that

�p∈R1

count(p) =�

ai∈A1ai,�

p∈R2count(p) =

�ai∈A2

ai, and all other Ri are empty:R1 is defined by two points, the origin and the point p hav-ing ith coordinate 1 when ai ∈ A1 and 0 otherwise. Thebounding box for R1 is closed at all edges and vertices. R2

is defined by the origin and the point p having ith coordi-nate = 1 when ai ∈ A2, and 0 otherwise. The boundingbox for R2 is open at the origin, but closed on all otheredges and vertices. CAV G is the average sum of counts forthe non-empty regions, divided by k. In this construction,CAV G = 1, and R1, ..., Rn is a k-anonymous multidimen-sional partitioning of P .

Finally, a given solution to the decisional k-anonymousmultidimensional partitioning problem can be verified inpolynomial time by scanning the input set of (point, count)pairs, and maintaining a sum for each region.�

2.3. Bounds on Partition Size

It is also interesting to consider worst-case upper bounds onthe size of partitions resulting from single-dimensional andmultidimensional partitioning. This section presents two re-sults, the first of which indicates that for a constant-sizedquasi-identifier, this upper bound depends only on k and

the maximum number of duplicate copies of a single point(Theorem 2). This is in contrast to the second result (The-orem 3), which indicates that for single-dimensional parti-tioning, this bound may grow linearly with the total numberof points.

In order to state these results, we first define some termi-nology. A multidimensional cut for a multiset of points isan axis-parallel binary cut producing two disjoint multisetsof points. Intuitively, such a cut is allowable if it does notcause a violation of k-anonymity.

Allowable Multidimensional Cut Consider multiset P ofpoints in d-dimensional space. A cut perpendicular to axisXi at xi is allowable if and only if Count(P.Xi > xi) ≥ kand Count(P.Xi ≤ xi) ≥ k.

A single-dimensional cut is also axis-parallel, but con-siders all regions in the space to determine allowability.

Allowable Single-Dimensional Cut Consider a multisetP of points in d-dimensional space, and suppose we havealready made S single-dimensional cuts, thereby separat-ing the space into disjoint regions R1, ..., Rm. A single-dimensional cut perpendicular to Xi at xi is allowable,given S, if ∀Rj overlapping line Xi = xi, Count(Rj .Xi >xi) ≥ k and Count(Rj .Xi ≤ xi) ≥ k.

Notice that recursive allowable multidimensional cutswill result in a k-anonymous strict multidimensional parti-tioning for P (although not all strict multidimensional par-titionings can be obtained in this way), and a k-anonymoussingle-dimensional partitioning for P is obtained throughsuccessive allowable single-dimensional cuts.

For example, in Figures 4(b) and (c), the first cut oc-curs on the Zipcode dimension at 53711. In the multidi-mensional case, the left-hand side is cut again on the Agedimension, which is allowable because it does not producea region containing fewer than k points. In the single-dimensional case, however, once the first cut is made, thereare no remaining allowable single-dimensional cuts. (Anycut perpendicular to the Age axis would result in a regionon the right containing fewer than k points.)

Intuitively, a partitioning is considered minimal whenthere are no remaining allowable cuts.

4

Page 46: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Mondrian algorithm

• Greedy top-down partitioning approach

• Choose dimension with maximum range

• Split at median if each newly created partition still satisfies k-anonymization and l-diversity

Page 47: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Example

(k = 50) (k = 25) (k = 10)

0 10 20 30 40 500

10

20

30

40

50

0 10 20 30 40 500

10

20

30

40

50

0 10 20 30 40 500

10

20

30

40

50

(a) Optimal single-dimensional partitioning

(k = 50) (k = 25) (k = 10)

0 10 20 30 40 500

10

20

30

40

50

0 10 20 30 40 500

10

20

30

40

50

0 10 20 30 40 500

10

20

30

40

50

(b) Greedy strict multidimensional partitioning

Figure 11. Anonymizations for two attributes with a discrete normal distribution (µ = 25, σ = .2).

Predicate on Xk Model Mean Error Std. Dev.10 Single 7.73 5.9410 Multi 4.66 3.2625 Single 12.68 7.1725 Multi 5.69 3.8650 Single 7.73 5.9450 Multi 7.94 5.87

Predicate on Yk Model Mean Error Std. Dev.10 Single 3.18 2.5610 Multi 4.03 3.4425 Single 5.06 4.1725 Multi 5.67 3.8050 Single 8.25 6.1550 Multi 8.06 5.58

Figure 12. Error for count queries with single-attribute selection predicates

ally produced the best results.

6.3. Workload-Based Quality

We also compared the optimal single-dimensional andgreedy multidimensional partitioning algorithms with re-spect to a simple query workload, using a synthetic dataset containing 1000 tuples, with two quasi-identifier at-tributes (discrete normal, each with cardinality 50, µ = 25,σ = .2). Visual representations of the resulting partition-ings are given in Figures 11(b) and 11(a).

Multidimensional partitioning does an excellent job atcapturing the underlying multivariate distribution. In con-

trast, we observed that for non-uniform data and small k,single-dimensional partitioning tends to reflect the distrib-ution of just one attribute. However, the optimal single-dimensional anonymization is quite sensitive to the under-lying data, and a small change to the synthetic data set oftendramatically changes the resulting anonymization.

This tendency to “linearize” attributes has an impact onquery processing over the anonymized data. Consider asimple workload for this two-attribute data set, consisting ofqueries of the form “SELECT COUNT(*) WHERE {X,Y }= value”, where X and Y are the quasi-identifier attributes,and value is an integer between 0 and 49. (In Figures 11(a)and 11(b), X and Y are displayed on the horizontal and ver-tical axes.) We evaluated the set of queries of this form overeach anonymization and the original data set. When a pred-icate did not match any partition, we assumed a uniformdistribution within each partition.

For each anonymization, we computed the mean andstandard deviation of the absolute error over the set ofqueries in the workload. These results are presented in Fig-ure 12. As is apparent from Figures 11(a) and 11(b), andfrom the error measurements, queries with predicates on Yare more accurately answered from the single-dimensionalanonymization than are queries with predicates on X. Theobserved error is more consistent across queries using themultidimensional anonymization.

7. Related WorkMany recoding models have been proposed in the litera-ture for guaranteeing k-anonymity. The majority of these

10

• More precision partitions are possible with smaller k

Page 48: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Query Accuracy

• 100 pathology reports

• 10,000 random queries

• age > n

• age < n

Page 49: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Query Accuracy

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

Que

ry P

reci

sion

(%)

k

Statistical De-identificationPartial De-identification

Full De-identification

Page 50: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Macro-data publishing• Differentially private data publishing (DPDP)

module in HIDE

• Create differentially private data cube where each dimension represents a statistic over the patient-centric view

• Partitioning algorithm based on information gain to maximize level of utility of differentially private data cube

• Consistency algorithm to enhance utility

Page 51: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Differentially Private InterfaceDifferentially Private Histogram Release

Differentially Private

Interface

Original Data

Diff.Private

HistogramUser

Answers

Queries

QueryStrategy

Diff. PrivateAnswers

Pre-designed Queries

Workload

• Differentially private histogram release for random predicate queries

Page 52: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

DPDP• DPDP considers:

• Access to original data

• Partitioning of the original database that best satisfies the workload of queries

• Level of differential privacy of data cube

• Level of utility (or noise) in the released data cube

Page 53: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Access to original database

• Every time the original database is accessed we use some of the privacy budget

• Access the original database in a differentially private manner

• Minimize the amount of times the original data is queried to minimize the amount of noise we must add to the results

Page 54: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Query Strategy

• Develop a query strategy that will allow the most utility given random queries from the user

• This query strategy is accomplished by partitioning the data according to information gain

Page 55: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Partitioning of the original database

• Release two data cubes

• One using cell-based algorithm that partitions database into itʼs individual cells and release a perturbed count for each cell

• One using top-down multi-dimensional partitioning strategy, where each split value selection maximizes the information gain and ensures the uniformity of the data points in the partition

• A consistency algorithm will be applied to the two data-cubes that will increase the accuracy of the released data-cubes

Page 56: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Cell partitioningQuery Strategies: Cell Partitioning

Income

Age9050

1050

40K 50K

20

30

• Q: select count() where Age = [20,30] and Income = 40K• If a query predicate consists of multiple cells or partitions, it will

have aggregated perturbation error

Income

Age90’50’

10’50’20

30

40K 50K

Q1: count() where Age = 20, Income = 40KQ2: count() where Age = 20, Income = 50K…

Q

alpha

• Select count where age > 20 and age < 30

• alpha is the differential privacy parameter

Page 57: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Multi-dimensional partitioningQuery Strategies: Multidimensional Partitioning

Income

Age9050

1050

40K 50K

20

30

• If a query predicate is contained in a published partition, the answerhas to be estimated typically based on a uniform distributionassumption. This introduces an approximation error.

Income

Age90’

10’100’

20

30

40K 50KMulti-dimensioning partitioning

• Select count where age > 20 and age < 30

• Noise is divided

Page 58: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Goals of partitioning strategy

• Large partitions to minimize aggregated perturbation error

• Uniform partitions to minimize approximation error

• Minimize the number of times we access the original data

Page 59: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Proposed ApproachProposed Approach

Income

Age9050

1050

40K 50K

20

30

90’50’

10’50’20

30

40K 50K

90’

10’50’+50’

20

30

40K 50K

2. Multi-dimPartitioning

90’

10’100’

20

30

40K 50K

1. Cell partitioning queries (alpha/2)

3. Multi-dim partitioning queries (alpha/2)

Page 60: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Utility of release• The level of utility is measured by

comparing the value a query workload on the released differentially private data cubes and a non-perturbed data cube generated from the original data

• We empirically evaluate the level of error that is function of the given privacy budget

Page 61: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Software• Web application

• Python, Django, and CouchDB

• Interface

• Iterative labeling of documents and training underlying classifier

• Analyze accuracy of classifier on validation sets

• Classifier is super-fast CRF provided by CRFSuite

Page 62: HIDE: Privacy Preserving Medical Data Publishinglxiong/cs573_s12/share/slides/0126_hide... · HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street

Publications• Y. Xiao, J. Gardner, L. Xiong. DPCube: Releasing Differentially Private Data Cubes for Health

Information (demo paper). In 28th IEEE International Conference on Data Engineering (ICDE), 2012

• James Gardner, Li Xiong, Fusheng Wang, Andrew Post and Joel Saltz. An evaluation of feature sets and sampling techniques for statistical de-identification of medical records. In 1st ACM International Health Informatics Symposium, 2010 (to appear).

• Li Xiong, James Gardner, Pawel Jurczyk and James J. Lu. Privacy Preserving Information Discovery on EHRs. In Information Discovery on Electronic Health Records, Ed. Vagelis Hristidis. Chapman and Hall/CRC, pp. 197–225, 2009.

• James Gardner and Li Xiong. An integrated framework for de-identifying unstructured medical data. Data and Knowledge Engineering, 68(12), pp. 1441–1451, 2009, doi:10.1016/j.datak.2009.07.006.

• James Gardner, Kanwei Li, Li Xiong and James J. Lu. HIDE: Heterogeneous Information DE- identification (demo track). 12th International Conference on Extending Database Technology (EDBT), March, 2009.

• James Gardner and Li Xiong. HIDE: A Health Information DE-identification System. In 21st IEEE International Symposium on Computer-Based Medical Systems (CBMS), June, 2008