ScalableBigDataClusteringbyRandomProjectionHashing · RPH K-M S-W DSt R-S Base Scalability Results...

1
Scalable Big Data Clustering by Random Projection Hashing Lee Carraher, Anindya Moitra, Sayantan Dey and Philip A Wilsey (PI) Dept of Electrical Engineering & Computing Systems, University of Cincinnati ( , 0 )= || , 0 || ∀{ , 0 }∈ , ˆ : ( , 0 ) > , ) ˆ 6= ( (( · ) · ˆ - , )= ) . || || ˜ := p | := H(˜) () []. () []= 0 1 2 3 4 5 6 # Processors 0 1 2 3 4 5 Speedup Ratio y=0.855x, R2=.996 R H(·) P = { , ... } ×

Transcript of ScalableBigDataClusteringbyRandomProjectionHashing · RPH K-M S-W DSt R-S Base Scalability Results...

Page 1: ScalableBigDataClusteringbyRandomProjectionHashing · RPH K-M S-W DSt R-S Base Scalability Results Clustering Algorithms Reservoir Sampling (R-S) DStream (DSt) Damped Sliding Window

Noise Robustness

Noise injection tests

20,000 points

10 random multivariate Gaussian

distributions

dimension = 759

Good performance in presence of noise

as high as 60%

s

Scalable Big Data Clustering by Random Projection HashingLee Carraher, Anindya Moitra, Sayantan Dey and Philip A Wilsey (PI)

Dept of Electrical Engineering & Computing Systems, University of Cincinnati

Security Via Random Projection

Increased Data Security and HIPAA Compliance Concerns make Commercial Cloud Computing an Expensive

Resource

Random Projection, In RPHash can Provide Signi�cant Resistance to De-Anonymization

The Probability of Re-association under Random Projection is exponentially dependent upon the di�erence in

the projection dimension and original dimension.

Contact

Philip A. Wilsey (PI): [email protected],Lee Carraher: [email protected],Anindya Moitra: [email protected] Dey: [email protected]

Probability of Vector

De-Anonymization

s(v , v ′) = ||v , v ′||2∀{v , v ′} ∈ V ,∃v̂ ∈ V : s(v , v ′) >s(v̂ , v)

where v̂ 6= v

Pr(NN((v · R) · R̂−1,V ) = v) . 1

||V ||

RPHash Algorithm Diagram

Figure:Streaming RPHash Data Diagram.

Streaming RPHash Algorithm

Online Steps

x̃ :=√

m

dpᵀx ; // Random Projection

t := H(x̃) ; // LSH Hashing

C .add(t); // Update Item Count

if t ∈ M.keys then

M[t].wadd(x) ; // Update Centroid

else

M[t] =new centroid(x) ; // Add New Centroid

end

M.remove(T.pop()) ; // Remove Least Support

O�ine Steps

O�ineWeightedClusterer(M.items)

Applications

Bio-Informatics

Medical imaging

Computer Vision

Recommender Systems

Social Network Analysis

Market research

AbstractRPHash provides a solution to the approximate k-means clustering problem for very large distributed datasets. Distributeddata models have gained popularity in recent years following the e�orts of commercial, academic and governmentorganizations, to make data more widely accessible. Due to the sheer volume of available data, in-memory single-corecompu- tation quickly becomes infeasible, requiring distributed multi- processing. Our solution achieves comparableclustering perfor- mance to other popular clustering algorithms, with improved overall complexity growth while beingamenable to distributed processing frameworks such as Map-Reduce. Our solution also maintains certain guaranteesregarding data privacy de-anonymization.

Key Features

Scalability through Naive Parallelism

Predictable Run Time

Low Inter-Process Communication

Data Security

Cloud Based Resource Allocation

Key Components

Multi-Probe Random Projection

Fast Johnson-Lindenstrauss Transform

Locality Sensitive Hashing

Universally Generative Lattices

Lossy Frequency Counting

Clustering Performance Results

Clustering Algorithms

Reservoir Sampling (R-S)

DStream (DSt)

Damped Sliding Window (S-W)

Streaming K-Means (KM)

Streaming RPHash (RPH)

ExternalPerformance Metrics

Adjusted Rand Index [-1,1]

Cluster Purity [0,1]

Variation of Information [0,∞]

0.050.250.550.85

ARI

better, range: [-1, 1]

worse

R-S DSt S-W K-M RPH

0.4750.6250.7750.925

Puri

ty

better, range: [0, 1]

worse

0 4000 8000 12000 16000 20000Progression of Input Stream Vectors

112500337500562500787500

WC

SSE

worse, range: (0, )

better

RPH K-M S-W DSt R-S Base

Scalability Results

Clustering Algorithms

Reservoir Sampling (R-S)

DStream (DSt)

Damped Sliding Window (S-W)

Streaming K-Means (KM)

Streaming RPHash (RPH)

Runtime Performance Metrics

Average Runtime Duration (seconds)

Estimated Memory Usage (MB)

Multi Core Speedup

RPHash exhibits near linear parallel speedup.

Few sequential bottlenecks .

Figure: Parallel Speedup as a Function of Number of Processors.

0 1 2 3 4 5 6

# Processors

0

1

2

3

4

5

Sp

eed

up

Rati

o

y=0.855x, R2=.996

RPHash Distributed Spark Implementation

Spark implementation of RPHash is based on 2-Pass RPHash.

Streaming RPHash Spark version in development.

Human Activity Recognition DatasetMetric 2Pass RPHash Baseline

Silhouette Width 0.1191 0.06171

WCSSE 191922.4 200951.7

Dunn Index 0.085822 0.087755UJIIndoor Localization DatasetMetric 2Pass RPHash Baseline

Silhouette Width -0.01337 0.06522

WCSSE 10534218 10234389

Dunn Index 0.000 0.000695

RPHash Data StructuresData

s k - number of clusters

x ∈ Rm - data vector from stream

H(·) - LSH Function radius=r , dim=d

P = {p1, ...pn} - m × d projection matrix

M- lsh_key → centroid map

C - cm-sketch data structure

T - CM-Sketch based εk bounded priority queue

Open Source Software Repositories

Source available at:

github.com/wilseypa/rphash-java

Sub-Projects:

github.com/wilseypa/rphash-golang

github.com/wilseypa/rphash-reference-implementation

github.com/wilseypa/rphash-spark

175287

122

Run

time

(sec

)

R-S DSt S-W K-M RPH

100

281

506

759

949

1139

1424

1709

2563

3204

3844

4805

5767

Number of Dimensions

37112187262

Mem

ory

(MB

)

2267

112157

Scal

ed

WC

SSE

worse, range: (0, )

better

Base R-S DSt S-W K-M RPH

0.01250.16250.33750.5125

Silh

ouet

te

better, range: [-1, 1]

worse

0.1250.3750.6250.875

ARI

better, range: [-1, 1]

worse

0 2 5 7 10 15 20 25 30 40 50 60 75Percent of Noise

0.31250.53750.76250.9875

Puri

ty

better, range: [0, 1]

worse