Tourism and Biodiversity – Achieving Common Goals Towards Sustainability
Towards Achieving Anonymity
description
Transcript of Towards Achieving Anonymity
An ZhuAn Zhu
Towards Achieving Towards Achieving AnonymityAnonymity
IntroductionIntroduction
Collect and analyze personal data Infer trends and patterns
Making the personal data “public” Joining multiple sources Third party involvement Privacy concerns Q: How to share such data?
Example: Medical RecordsExample: Medical Records
Identifiers Sensitive Info
SSN Name Age Race Zipcode Disease
614 Sara 31 Cauc 94305 Flu
615 Joan 34 Cauc 94307 Cold
629 Kelly 27 Cauc 94301 Diabetes
710 Mike 41 Afr-A 94305 Flu
840 Carl 41 Afr-A 94059 Arthritis
780 Joe 65 Hisp 94042 Heart problem
616 Rob 46 Hisp 94042 Arthritis
De-identified RecordsDe-identified Records
Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Not Sufficient! Not Sufficient! [Sweeney 00’][Sweeney 00’]
Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 ArthritisPublic Database
UniqueIdentifiers!
Not Sufficient! Not Sufficient! [Sweeney 00’][Sweeney 00’]
Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 ArthritisPublic Database
UniqueIdentifiers!
Anonymize the Quasi-Identifiers!Anonymize the Quasi-Identifiers!
Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
*** *** *** Flu
*** *** *** Cold
*** *** *** Diabetes
*** *** *** Flu
*** *** *** Arthritis
*** *** *** Heart problem
*** *** *** ArthritisPublic Database
UniqueIdentifiers!
Q:Q: How to share such data? How to share such data?
Anonymize the quasi-identifiers Suppress information
Privacy guarantee: anonymity Quality: the amount of suppressed information
Clustering Privacy guarantee: cluster size Quality: various clustering measures
Q:Q: How to share such data? How to share such data?
Anonymize the quasi-identifiers Suppress information
Privacy guarantee: anonymity Quality: the amount of suppressed information
Clustering Privacy guarantee: cluster size Quality: various clustering measures
kk-anonymized Table-anonymized Table [Samarati 01’][Samarati 01’]
Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Each rowis identicalto at least k-1otherrows
kk-anonymized Table-anonymized Table [Samarati 01’][Samarati 01’]
Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
* Cauc * Flu
* Cauc * Cold
* Cauc * Diabetes
41 Afr-A * Flu
41 Afr-A * Arthritis
* Hisp 94042 Heart problem
* Hisp 94042 Arthritis
Definition: Definition: kk-anonymity-anonymity
Input: a table consists of n row, each with m attributes (quasi-identifiers)
Output: suppress some entries such that each row is identical to at least k-1 other rows
Objective: minimize the number of suppressed entries
Past Work and New ResultsPast Work and New Results
[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation
[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity
Past Work and New ResultsPast Work and New Results
[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation
[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity
Graph RepresentationGraph Representation
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A:
B:
C:
D:
E:
F:
4
2
4
6
3
A B
F
E D
C
3
W(e)=Hamming distance between the two rows
2
Edge Selection IEdge Selection I
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A:
B:
C:
D:
E:
F:
2
2
3
A B
F
E D
C
Each node selects thelightest weight edge
0
k=3
3
Edge Selection IIEdge Selection II
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A:
B:
C:
D:
E:
F:
2
3
A B
F
E D
C
For components with <kvertices, add more edges
0
k=3
2
LemmaLemma
Total weight of edges selected is no more than OPT In the optimal solution, each vertex pays at
least the weight of the (k-1)st lightest weight edge
Forest: at most one edge per vertex By construction, the edge weight is no more
than the (k-1)st lightest weight edge per vertex
GroupingGrouping
Ideally, each connected component forms a group
Anonymize vertices within a group
Total cost of a group: (total edge weights)
(number of nodes) (2+2+3+3)6
3 2
2
3
A B
F
E D
C
0
Small groups: O(k)
Dividing a Component Dividing a Component
Root tree arbitrarily Divide if Sub-trees & rest k
Aim: all sub-trees <k
kk k
<k<k<k<k
k
Dividing a Component Dividing a Component
Root tree arbitrarily Divide if Sub-trees & rest k
Rotate the tree if necessary
kk
k
Dividing a Component Dividing a Component
Root tree arbitrarily Divide if Sub-trees & rest k T. condition: max(2k-1, 3k-5)
<k<k
<k
<k<k
An ExampleAn Example
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A:
B:
C:
D:
E:
F:
3 2
2
3
A B
F
E D
C
0
0
3
An ExampleAn Example
C
FE
D
B
A
2 2 3
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A:
B:
C:
D:
E:
F:
0
3
An ExampleAn Example
C
FE
D
B
A
2 2
Estimatedcost:43+33
0 * 1 0 * *
* * 0 1 * 1
* * 0 1 * 1
0 * 1 0 * *
* * 0 1 * 1
0 * 1 0 * *
A:
B:
C:
D:
E:
F:
Optimal cost:33+33
Past Work and New ResultsPast Work and New Results
[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation
[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity
1.51.5-approximation-approximation
0 0 1 0 0 0
0 0 0 0 0 0
1 1 1 1 1 1
0 0 1 0 0 0
1 1 0 1 1 1
1 1 0 1 1 1
A:
B:
C:
D:
E:
F:
1
6
5
6
6
A B
F
E D
C
0
W(e)=Hamming distance between the two rows
MinimumMinimum {1,2} {1,2}-matching-matching
0 0 1 0 0 0
0 0 0 0 0 0
1 1 1 1 1 1
0 0 1 0 0 0
1 1 0 1 1 1
1 1 0 1 1 1
A B
F
D
Each vertex is matched to1 or 2 other vertices
0
0
1
E
C
1A:
B:
C:
D:
E:
F:
PropertiesProperties
Each component has 3 nodes
Not OptimalNot possible(degree 2)
>3
Cost 2OPT
For binary alphabet: 1.5OPT
QualitiesQualities
a p q
r p,qOPT pays: 2aWe pay: 2a OPT pays: p+q+r
We pay: 3(p+q) 2(p+q+r)
Past Work and New ResultsPast Work and New Results
[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation
[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity
Open ProblemsOpen Problems
Can we improve O(k)? (k) for graph representation
Open ProblemsOpen Problems
Can we improve O(k)? (k) for graph representation
11111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111
k = 5, d = 16, c = k d / 2
Open ProblemsOpen Problems
Can we improve O(k)? (k) for graph representation
11111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111
k = 5, d = 16, c = k d / 2
Open ProblemsOpen Problems
Can we improve O(k)? (k) for graph representation
1010101010101010101010101010101011001100110011001100110011001100111100001111000011110000111100001111111100000000111111110000000011111111111111110000000000000000
k = 5, d = 16, c = 2 d
Open ProblemsOpen Problems
Can we improve O(k)? (k) for graph representation
1010101010101010101010101010101011001100110011001100110011001100111100001111000011110000111100001111111100000000111111110000000011111111111111110000000000000000
k = 5, d = 16, c = 2 d
Q:Q: How to share such data? How to share such data?
Anonymize the quasi-identifiers Suppress information
Privacy guarantee: anonymity Quality: the amount of suppressed information
Clustering Privacy guarantee: cluster size Quality: various clustering measures
Clustering ApproachClustering Approach [[AFKKPTZ 06AFKKPTZ 06’]’]
Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Transfers into a Metric…Transfers into a Metric…
Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Clusters and CentersClusters and Centers
Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Clusters and CentersClusters and Centers
Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
Cold
Diabetes
Flu
41 Afr-A 94059 Arthritis
Heart problem
46 Hisp 94042 Arthritis
MeasureMeasure
How good are the clusters “Tight” clusters are better
Minimize max radius: Gather-k Minimize max distortion error: Cellular-k
radius num_nodes
Cost:
Gather-k: 10
Cellular-k: 624
MeasureMeasure
How good are the clusters “Tight” clusters are better
Minimize max radius: Gather-k Minimize max distortion error: Cellular-k
radius num_nodes
Handle outliers Constant approximations!
ComparisonComparison
k = 5 5-anonymity
Suppress all entries More distortion
Clustering Can pick R5 as the center Less distortion Distortion is directly related
with pair-wise distances
R1 0 1 1 1
R2 1 0 1 1
R3 1 1 0 1
R4 1 1 1 0
R5 1 1 1 1
ResultsResults [[AFKKPTAFKKPTZ 06Z 06’]’]
Gather-k Tight 2-approximation Extension to outlier: 4-approximation
Cellular-k Primal-dual const. approximation Extensions as well
ResultsResults [[AFKKPTAFKKPTZ 06Z 06’]’]
Gather-k Tight 2-approximation Extension to outlier: 4-approximation
Cellular-k Primal-dual const. approximation Extensions as well
22-approximation-approximation
Assume an optimal value R Make sure each node has at least k – 1
neighbors within distance 2R.
A
R
2R
22-approximation-approximation
Assume an optimal value R Make sure each node has at least k – 1
neighbors within distance 2R. Pick an arbitrary node as a center and
remove all remaining nodes within distance 2R. Repeat until all nodes are gone.
Make sure we can reassign nodes to the selected centers.
Example: Example: kk = 5 = 5
Optimal SolutionOptimal Solution
1 2
R
Center SelectionCenter Selection
Center SelectionCenter Selection
1
Center SelectionCenter Selection
1
2R
Center SelectionCenter Selection
2R
1
Center SelectionCenter Selection
2
1
2R
Center SelectionCenter Selection
2
1
2R
ReassignmentReassignment
2
1
Degree Constrained MatchingDegree Constrained Matching
1
≥ k-1
≥ k-1
=1
=1
=1=1
=1
=1
=1
=1 =1
2
Actual ClusteringActual Clustering
1
2
Optimal ClusteringOptimal Clustering
1 2
Our guaranteesOur guarantees
Return clusters of radius no more than 2R
If R is guessed correctly, then reassignment is possible Each cluster has at least k nodes
Do a binary search on the value of R suffices
Binary Search on Binary Search on RR
Assume an optimal value R Make sure each node has at least k – 1
neighbors within distance 2R. Pick an arbitrary node as a center and
remove all remaining nodes within distance 2R. Repeat until all nodes are gone.
Make sure we can reassign nodes to the selected centers.
Binary Search on Binary Search on RR
Assume an optimal value R Make sure each node has at least k – 1
neighbors within distance 2R. Not necessary, but is useful for quick pruning
Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone.
Make sure we can reassign nodes to the selected centers.
Binary Search on Binary Search on RR
Assume an optimal value R Make sure each node has at least k – 1
neighbors within distance 2R. Not necessary, but is useful for quick pruning
Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone.
Make sure we can reassign nodes to the selected centers. If successful, R could be smaller Otherwise, R should be larger
ResultsResults [[AFKKPTZ 06AFKKPTZ 06’]’]
Gather-k Tight 2-approximation Extension to outliner: 4-approximation
Cellular-k Primal-dual const. approximation Extensions
Ignore Cluster Size ConstraintIgnore Cluster Size Constraint
Similar to Facility Location radius num_nodes vs. invidual_distance_to_center
Caveat Assigning one distant node to an existing
cluster will increase cost proportional to number of nodes in that cluster
Each cluster is a (center, radius) pair
Intermediate Step IIntermediate Step I
Primal-dual constant approximation for radius num_nodes No cluster size constaint Arbitrary cluster setup cost
We want radius num_nodes Cluster size constraint No cluster setup cost
Enforce Cluster SizeEnforce Cluster Size
Introduce extra cluster setup cost Setup cost pays for k nodes to join a
particular cluster, i.e., csetup = k r This at most doubles the actual cost of
any size constrained cluster solution Each cluster’s total cost is at least k r
Intermediate Step IIIntermediate Step II
Shared solution! For each cluster with less than k nodes,
additional nodes can join the cluster At no additional cost, paid for by the cluster
setup cost Now nodes could be shared among multiple
clusters Key: convert a “shared” solution to a
disjoint solution
AttachedAttached
Attached
SeparationSeparation
Starting from small radius clusters
“Open” as long as there are enough nodes
The left over points in clusters “attach” to the intersecting smaller radius (open) clusters
Open
Regroup (Regroup (kk = 5 = 5))
Open cluster has ≥k nodes
Attached cluster has <k nodes
Group clusters to create bigger ones
Choose the “fat” cluster’s center as the new center
3 2 4
6
What About Cluster Cost?What About Cluster Cost?
These clustering intersects with the open cluster
What About Cluster Cost?What About Cluster Cost?
These clustering intersects with the open cluster
Routing cost is only a constant blowup w.r.t. the fat radius
What About Cluster Cost?What About Cluster Cost?
These clustering intersects with the open cluster
Routing cost is only a constant blowup w.r.t. the fat radius
Need to make sure the merged cluster is of reasonable size
RecapRecap
Anonymize the quasi-identifiers Suppress information
Privacy guarantee: anonymity Quality: the amount of suppressed information
Clustering Privacy guarantee: cluster size Quality: various clustering measures
Thanks!Thanks!