A Secure Clustering Algorithm for Distributed Data Streams
description
Transcript of A Secure Clustering Algorithm for Distributed Data Streams
![Page 1: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/1.jpg)
A Secure Clustering Algorithm for Distributed Data Streams
Geetha Jagannathan
Rutgers University
Joint work with Krishnan Pillaipakkamnatt and D. Umano
![Page 2: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/2.jpg)
Outline
The problemPrior resultsClustering data streamsExperimental results and comparisonA privacy-preserving protocolConclusion
![Page 3: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/3.jpg)
The problem
Alice and Bob each have a data stream, defined on the same attributes.
(horizontal partition)
The wish to compute a clustering on the combined data.
![Page 4: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/4.jpg)
Bob Alice
1 DInput : Data stream 2 DInput : Data stream
1 2m nk D DOutput : - clustering of
1 1mD m DThe first elements of
2 2nD n DThe first elements of
![Page 5: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/5.jpg)
Clustering on joint dataAlice’s Data
k = 4
![Page 6: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/6.jpg)
Clustering on joint dataBob’s Data
k = 4
![Page 7: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/7.jpg)
Clustering on joint dataCombined Data
k = 4
![Page 8: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/8.jpg)
Trusted third party
AliceBob
1mD 2
nD
k-clustering
k-clustering
![Page 9: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/9.jpg)
Privacy requirements
Parties are semi-honest
Same as trusted third party
Reveals nothing but the final output
In this case – the k cluster centers
![Page 10: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/10.jpg)
Prior results
PPDM protocols convert distributed DM algorithms into private ones
The k-means algorithm is the basis for many clustering protocols [VC03, JKM05, JW05, BO07]
“Leak” intermediate information[JPW05] presents a leak-free clustering
protocol based on a new clustering algorithm.
![Page 11: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/11.jpg)
Our Contributions
A leak free privacy-preserving protocol for distributed data streams.
A data stream clustering algorithmBetter than k-means (on average)Comparable performance with BIRCH on
many data sets, but with lower memory needs.
![Page 12: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/12.jpg)
Data Stream Algorithms
Data arrives in “stream” fashion: d1, d2, …, dn, … (the “end” of the stream is not known ahead of time).
Data is too large to fit entirely in memory.Data can be accessed only in the order
that it arrives.Each data item can only be “read” once.
![Page 13: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/13.jpg)
The clustering algorithm
“Incrementally agglomerative”: It merges intermediate clusters without waiting for all the data to be available.
Runs in time linear in n.
![Page 14: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/14.jpg)
Overview of clustering algorithm
K = 5
Level 0 clustering
Level 1 clustering
Level 2 clustering
Output
Output expected after n = 25 data points
![Page 15: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/15.jpg)
Clustering Algorithm Outline
The algorithm maintains a list of k-clusterings (each clustering is on some partial data).
In each iteration: Input the next k data points as a level-0
clustering. If two clusterings at level i are in the list,
“merge” them into a level-(i + 1) k-clustering.
![Page 16: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/16.jpg)
Clustering algorithm outline
If output is needed after some n points have been read, all k-clusterings are “merged” into a single k-clustering.
![Page 17: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/17.jpg)
“Merging” clusterings
Have a set S clusters, which |S| > k.Need a set S' of k clusters.
S' = SRepeat
Compute merge error for every pair of clustersTake the union of the pair with lowest error
Until |S'| = k
![Page 18: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/18.jpg)
Error (C1 U C2) =
C1.weight * C2.weight * (dist(C1, C2))2
![Page 19: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/19.jpg)
Sample results (offset grid)
![Page 20: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/20.jpg)
Sample results (vs k-means)
![Page 21: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/21.jpg)
Sample result (vs. BIRCH)
![Page 22: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/22.jpg)
Realistic Data (Network Intrusion)
Algorithm Mem. Allowed
(× 24000 bytes)
ESS
StreamCluster 1 4.1E14
BIRCH 1 *
BIRCH 2 *
BIRCH 4 *
BIRCH 32 *
BIRCH 64 4.8E17
BIRCH 128 4.8E17
![Page 23: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/23.jpg)
The Secure Protocol
Input: Alice owns data stream D1
Bob owns data stream D2
Output : k-clusters on D1m U D2
n
1. Alice computes O(k log ( )) cluster centers and Bob computes O(k log ( )) cluster centers
2. Alice and Bob securely share their cluster centers
3. They securely merge clusters
km
kn
![Page 24: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/24.jpg)
Sample Run(Distributed non-private protocol)
![Page 25: A Secure Clustering Algorithm for Distributed Data Streams](https://reader035.fdocuments.net/reader035/viewer/2022062314/568144db550346895db1a935/html5/thumbnails/25.jpg)
Complexity
Communication complexity: O((k log(mn/k2)2)
Non-private setting (one party sends the intermediate clusters to the other)Comm complexity: O(k log (m/k))
kmn