Random Projection, Generative Lattices, and Redundancy to...

61
Random Projection, Generative Lattices, and Redundancy to Combat Scalability Bounds In Distributed Computing Lee A. Carraher School of Electronic and Computing Systems University of Cincinnati March 3, 2014

Transcript of Random Projection, Generative Lattices, and Redundancy to...

Page 1: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Random Projection, Generative Lattices, andRedundancy to Combat Scalability Bounds In

Distributed Computing

Lee A. Carraher

School of Electronic and Computing SystemsUniversity of Cincinnati

March 3, 2014

Page 2: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Introduction

I From CincinnatiI B.S. Computer Engineering (UC 2008)I M.S. Computer Science (UC 2012)

I Advisor Prof. Fred AnnexsteinI Ph.D Computer Engineering (ongoing)

I Advisor: Prof. Fred Annexstein

2 / 60

Page 3: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Interests

Some research interests:I Machine Learning (bioinformatics, filtering)I Inverse Problems (min/max problems)I Parallel Computing (CUDA, MPI)I Distributed Computing (Mapreduce, Spark)I Big Data

3 / 60

Page 4: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Motivations

“What are the important problems of your field?”- Richard Hamming

4 / 60

Page 5: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Data Storage is Cheap

I Store Everything Because it’s cheap (NSA...)

5 / 60

Page 6: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Big Problems in Computing

Two problems:1. We are storing more data than we can effectively process

(n→∞)2. Stagnated Clock speeds

I materials problemI energy problemI fundamental cooling problem (Landauer’s Principle)

6 / 60

Page 7: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Ways To Attack Computing Problems

Faster Processors

More Processors

Better Algorithms

7 / 60

Page 8: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Parallel Computing

Simple, Add more processors!Basic Issues

I Communication BottlenecksI Algorithmic Bottlenecks

8 / 60

Page 9: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Scaled Speedup Example

0 50 100 150 200Parallel Speedup

5

10

15

20

25

30

Tota

l Spe

edup

Query: Total Speedup as a function of Parallel Speedup

ParallelismOur Theoretical Speedup (93.4%)90% Serial Code95% Serial Code97% Serial CodeTheoretical Speedup

I 80/20 rule (Pareto) isn’t even on here!

9 / 60

Page 10: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Ways To Attack Computing Problems

Faster Processors

More Processors

Better Algorithms

10 / 60

Page 11: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Has Moore’s Law Stalled?

I Clock speed is dead.11 / 60

Page 12: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Ways To Attack Computing Problems

Faster Processors

More Processors

Better Algorithms

12 / 60

Page 13: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Better Algorithms?

I This attack plan is not well definedI Some algorithms are optimal

13 / 60

Page 14: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Ways To Attack Computing Problems

Faster Processors

More Processors

Better Algorithms

I What about the overlaps?

14 / 60

Page 15: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

RPHASH

Random Projection Hashing Goals:I ScalabilityI Minimize Communication Complexity

Tradeoff:I RedundancyI Accuracy

15 / 60

Page 16: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Related Work

Database Clustering MethodsI DBScanI CliqueI CLARANSI Proclus

16 / 60

Page 17: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Background

I Big DataI CODI Locality Sensitive Hash FunctionsI Space PartitioningI Lattices

I A Decoding ExampleI Leech LatticeI Leech Decoder

I Functional ProgrammingI MR/HadoopI Random Projection

17 / 60

Page 18: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Working Definition of “Big Data”

Sales and Commercial hype aside,

Definition (Big Data)A set of data processing problems in which the required data istoo large to reside in main memory.Thrashing between MM and HD(even solid state) . unscalablealgorithm

I Health MetricsI DNA SequencesI Website/Click Metrics

18 / 60

Page 19: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

COD

Curse of DimensionalityCOD is sometimes cited as the cause for the distance functionloosing its usefulness for high dimensions. This arises from theratio of metric space partitioning to hypersphere embedding.

limd→∞

Vol(Sd )

Vol(Cd )=

πd/2

d2d−1Γ (d/2)→ 0

Given a single distribution, the minimum and the maximumdistances become indiscernible. Or the relative majority ofspace is outside of the sphere

19 / 60

Page 20: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

LSH Hash Families

Definition (Locality Sensitive Hash Function)let H = {h : S → U} is (r1, r2,p1,p2)−sensitive if for anyu, v ∈ S

1. if d(u, v) ≤ r1 then PrH[h(u) = h(v)] ≥ p1

2. if d(u, v) > r2 then PrH[h(u) = h(v)] ≤ p2

For this family ρ = log p1log p2

20 / 60

Page 21: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

An Example Hash Family

.

.

.

.

.

.

.

.

.

.

0

1

2

3

...

w

w

w

r

Figure : Random Projection of R2 → R1

21 / 60

Page 22: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Voronoi Tiling

Voronoi partitioning is optimal in 2D.

Figure : Voronoi Partitioning of R2

22 / 60

Page 23: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Voronoi

I Voronoi diagrams make for very efficient hash functions in2d because, by definition, a point within a Voronoi region isnearest to the regions representative point.

I Voronoi regions provide an optimal solution to the NNpartitioning in 2-d Space!

I However, for arbitrary dimension d, Voronoi diagramsrequire Θ(nd/2)-space, and no known optimal pointlocation algorithms exists.

23 / 60

Page 24: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Lattices

Instead we will consider lattices, which provide regular spacepartitioning and scale to arbitrarily large dimensional space,and have sub-linear nearest center search algorithmsassociated with them.

Definition (Lattice in Rn)let v1, ..., vn be n linear independent vectors wherevi = vi,1, vi,2, ..., vi,n The lattice Λ with basis {v1, ..., vn} is the setof all integer combinations of v1, ..., vn the integer combinationsof the basis vectors are the points of the lattice.

Λ = {z1v1 + z2v2 + ...+ znvn|zi ∈ Z,1 ≤ i ≤ n}

24 / 60

Page 25: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Examples in 2D

Figure : Square(left) and Hexagonal(right) Lattices in R2

25 / 60

Page 26: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Constant Time Decoding

I Certain lattices allow us to find the nearest representativepoint in constant time.

I For example the above square lattice.I The nearest point can be found by simply rounding our real

valued point to its nearest integer.I With the exception of a few exceptional lattices, more

complex lattices have more complex searches (exponentialas d increases ).

26 / 60

Page 27: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Exceptional Higher Dimensional Lattices

The previous lattices work well in R2, but our data spaces are ingeneral� 2.

I fortunately there are some higher dimensional lattices, withefficient nearest center search algorithms.

I E8 or Gosset’s Lattice, is one such lattice inI it is also the densest lattice packing in R8.

27 / 60

Page 28: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Example of decoding E8

E8 can be formed by gluing two D8 integer lattices together andshifting by a vector of 1

2 . This gluing of less dense lattices andshifting by a “glue vector” is a common theme in finding denselattices.

I Decoding D8 is simpleI E8 = D8 ∩ D8 + 1

2I both cosets of D8 can be computed in parallelI D8’s decoding algorithm consists of rounding all values to

their nearest integer value s.t they sum to an even number

28 / 60

Page 29: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Example of decoding E8

define f (x) and g(x) to round the components of x, except ing(x) we round the furthest value from an integer in the wrongdirection.let

x =< 0.1,0.1,0.8,1.3,2.2,−0.6,−0.7,0.9 >

thenf (x) =< 0,0,1,1,2,−1,−1,1 >, sum = 3

andg(x) =< 0,0,1,1,2,0,−1,1 >, sum = 4

I since g(x) is even, it is the nearest lattice point in D8

29 / 60

Page 30: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Example of decoding E8 conti..

I Include the coset ∩D8 + 12 .

I We can do this by subtracting 12 from all the values of x.

f (x − 12

) =< 0,0,0,1,2,−1,−1,0 >, sum = 1

g(x − 12

) =< -1,0,0,1,2,−1,−1,0 > sum = 0

Now we find the coset representative that is closest to x using asimple distance metric.

||x − g(x)||2 = 0.65

||x − g(x12

)||2 = 0.95

So this case it is the first coset representative:

< 0,0,1,1,2,0,−1,1 >

30 / 60

Page 31: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Leech Lattice

By gluing sets of E8 together in a way originally conceived byCurtis’ MOG, we can get an even higher dimensional denselattice called the Leech lattice.

31 / 60

Page 32: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Leech Lattice

Here we will state some attributes of the leech lattice as well asgive a comparison to other lattices by way of Eb/N0 and thecomputational cost of decoding.Some Important Attributes:

I Densest Regular Lattice Packing in R24

I Lattice Construction can be based on 2 cosets G24

I Sphere Packing Density: π12

12! ≈ 0.00192957I Kmin = 196560

32 / 60

Page 33: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Leech Lattice

0 5 10 15 20 25EB/NO (dB)

�5

�4

�3

�2

�1

Prob

abili

ty o

f Bit

Erro

r - lo

g10(

Pb)

Error Correcting Performance (3072000 Samples)

Leech hex -.2db and QAM16E8 2PSKUnencoded QAM64Unencoded QAM16

Figure : Performance of Some Coded and Unencoded DataTransmission Schemes

33 / 60

Page 34: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Leech Lattice Decoding

Some information about the decoding algorithm:I The decoding of the leech lattice is based closely on the

Decoding of the Golay Code.I In general, advances in either Leech decoding or binary

Golay decoding imply an advance in the other.I The decoding method used in this implementation is based

on Amrani and Be’ery’s ’96 publication for decoding theLeech lattice, and consists of around 519 floating pointoperations and suffers a gain loss of only 0.2bB.

I In general decoding complexity scales exponentially withdimension.

Next is an outline of the decoding process.

34 / 60

Page 35: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Leech Decoder

QAM Decoder

A Points

Block Confidence

Even Reps

Min Over H6

Even Reps

H-Parity

Even

K-Parity

Even

QAM Decoder

B Points

Block Confidence

Odd Reps

Block Confidence

Even Reps

Block Confidence

Odd Reps

Min Over H6

Odd Reps

H-Parity

Odd

Min Over H6

Even Reps

Min Over H6

Odd Reps

H-Parity

Even

K-Parity

Odd

H-Parity

Odd

Compare Candidate Codewords

24 Reals Received

K-Parity

Odd

K-Parity

Even

Figure : Leech Lattice Decoder

35 / 60

Page 36: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Functional Programming

Figure : Scargill, by Tim (timble.me.uk/blog/author/tim)

Parallel and Functional ProgrammingI Instead of moving the mountain to the people,Move the the

people to the mountainsI Where mountains are data and people are functions

respectively

36 / 60

Page 37: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Map Reduce Program Design

Figure : Map Reduce (courtesy: map-reduce.wikispaces.asu.edu

37 / 60

Page 38: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Hadoop

Hadoop is an open source implementation of the map reduceframework created and maintained by the Apache SoftwareGroup. Benefits

I Open SourceI Popular, and MaintainedI FreeI Implemented and compatible with Amazon EC2I Takes care of the networking and fault tolerance drudgery

of parallel system programming.

38 / 60

Page 39: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Hadoop Parallel System Design

Figure : Hadoop System Design (ibm.com/developerworks)

39 / 60

Page 40: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

EC2 Cloud

Cloud and Distributed ServicesI Scalable to data processing problems needsI Very Low Cost Processing ModelI Always Up to Data HW ResourcesI Zero HW maintenance and overhead costs

40 / 60

Page 41: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Mahout

Mahout is an open source library of machine learningalgorithms made for Hadoop.Clustering Algorithms:

I Canopy ClusteringI K-MeansI Mean ShiftI LDAI MinHash

41 / 60

Page 42: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Random Projection

Figure : Random Projection: R3 → R2

42 / 60

Page 43: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Random Projection

Theorem (JL - Lemma)

(1− ε)‖u − u′‖2 ≤ ‖f (u)− f (u′)‖2 ≤ (1 + ε)‖u − u′‖2

ε - is an distortion quantityu,u′ ∈ U - two independent vectorsf - a random projection mapping

Rd → Rl

l ∝ Θ(log(n)

ε2log(1/ε))

43 / 60

Page 44: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

FJLT

Fast Johnson-Lindenstrauss TransformI New and cool!I Using Heisenberg Uncertainty in Harmonic Analysis, a

spectrum and its signal cannot both be concentratedI Precondition projection with DFT, (some matrices have

very fast DFTs)I ≈ Θ(d log(d) + ε−3 log2(n)) vs Θ(dε−2 log(n))

44 / 60

Page 45: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Motivations of RP Hash

I Parallel Structure of a scalable parallel algorithm (LogReduce)

I Low Communication Overhead (Hashes)I Non -Parallel Iterative Structure (Per core redundancy)I Approximation is usually good enough

45 / 60

Page 46: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

General Idea of RPHash

Figure : Multi-Probe Random Projection Intersection Probabilities

I generative space quantizationI random projectionI sequential multi-probe stochastic process

46 / 60

Page 47: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Occultation Problem

The occultation problem is the probability of two or moreindependent distributions overlapping in projected space.

I based on the distribution variance and angle of theprojective plane

I Applicable bounds from Urruty ’07. d is number of probes

I limd→∞

1−(

2(r1+r2)π‖d−c‖

)d

I In RPHash, d is the dimensionality (24).I RPHash projections are orthogonal

47 / 60

Page 48: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Sequential algorithm

1. Generate Random Projection matrix P2. Maintain DBcount of hash id’s3. Maintain DBcent Array of centroids corresponding to vectors4. Forall x ∈ X :

4.1 index = LatticeDec(xP>)4.2 DBcount [ID]++4.3 DBcent [ID]+ = x

5. sort[DBcount ,DBcent ]6. return DBcent [0 : k ]

48 / 60

Page 49: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Comparison with standard k-means

49 / 60

Page 50: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Sequential Algorithm Time Results

5000 10000 15000 20000Number of Vectors

2

0

2

4

6

Ela

pse

d T

ime (

s)

Time on Gaussian Data

RP Hash Mean: y = -0.101 + 0.0002xK-Means Mean: y = -0.050 + 8.3470E-5xRP HashK-Means

4000 5000 6000 7000 8000 9000 10000 11000 12000 13000Number of Dimensions

0

2

4

6

8

10

Ela

pse

d T

ime (

s)

Time on Gaussian Data

RP Hash Mean: y = .3730 + .0006xK-Means Mean: y = .0963 + .0001xRP HashK-Means

50 / 60

Page 51: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Sequential Algorithm Accuracy Results

0 1 2 3 4 5 6 7 8 9Number of Clusters

0.70

0.75

0.80

0.85

0.90

0.95

1.00

PR

Perf

orm

ance

Precision Recall on 50:50 Gaussian Data

RP Hash Mean: y = .8803 - .0155xK-Means Mean: y = .9963 - .0189xRP HashK-Means

0 2000 4000 6000 8000 10000 12000 14000 16000Dimensions

0.70

0.75

0.80

0.85

0.90

0.95

PR

Perf

orm

ance

Precision Recall on 50:50 Gaussian Data

RP Hash Mean: y = .7942 + 3.885E-7xK-Means Mean: y = .8681 - 1.040E-6xRP HashK-Means

51 / 60

Page 52: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Scalability Goal

I It would be nice if our algorithm scaled with the number ofprocessing nodes

I lets look at an algorithm that scales well and try to apply itto our problem

I the canonical hadoop ”Hello World” simulacrum ”WordCount” is a good place to start

52 / 60

Page 53: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Hadoop Word Count

53 / 60

Page 54: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Hadoop Word Count Scalability

54 / 60

Page 55: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Basic Outline

I Each processing node computes hashes and countsI Share the top k buckets to all computersI Aggregate all centroid averages

A Trick to minimize communication and storage reqrmnt.I Two Phase:I Phase 1: Only store counts and communicate IDsI Phase 2: Only accept hash collisions with Phase 1’s top

IDs for all clusters

55 / 60

Page 56: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Parallel Algorithm Phase 1

beginX = {x1, ..., xn}, xk ∈ Rm - data vectorsD - set of available compute nodesH - is a d dimensional LSH functionX̃ ⊆ X - vectors per compute nodePm→d - Gaussian projection matrixCs = {∅} - set of bucket collision countsforeach xk ∈ X̃ do

x̃k ←√

md P

ᵀxk

t = H(x̃k )Cs[t ] = Cs[t ] + 1

endsort({Cs,Cs.index}) return {Cs,Cs.index}[0 : k log(n)]

end

56 / 60

Page 57: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Parallel Algorithm Phase 2

beginX = {x1, ..., xn}, xk ∈ Rm - data vectorsD - set of available compute nodes{Cs,Cs.index} - set of klogn cluster IDs and countsH - is a d-dimensional LSH functionX̃ ⊆ X - vectors per compute nodepm→d ∈ P - Gaussian projection matricesC = {∅} - set of centroidsforeach xk ∈ X̃ do

foreach pm→d ∈ P̃ do

x̃k ←√

md pᵀxk

t = H(x̃k ) if t ∈ Cs.index thenC[t ] = C[t ] + xk

endend

endreturn C

end

57 / 60

Page 58: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

What to Use this For?

1.G

SM

6531

71.

GS

M65

330

3.G

SM

6531

61.

GS

M65

361

1.G

SM

6536

81.

GS

M65

349

1.G

SM

6578

21.

GS

M65

334

1.G

SM

6536

51.

GS

M65

362

1.G

SM

6534

61.

GS

M65

355

1.G

SM

6577

61.

GS

M65

815

1.G

SM

6575

41.

GS

M65

773

1.G

SM

6587

33.

GS

M65

860

3.G

SM

6586

73.

GS

M65

859

1.G

SM

6532

91.

GS

M65

336

3.G

SM

6533

81.

GS

M65

779

1.G

SM

6583

01.

GS

M65

753

1.G

SM

6580

01.

GS

M65

363

1.G

SM

6537

91.

GS

M65

875

1.G

SM

6531

81.

GS

M65

781

1.G

SM

6576

71.

GS

M65

798

1.G

SM

6578

51.

GS

M65

791

1.G

SM

6579

71.

GS

M65

320

1.G

SM

6581

21.

GS

M65

771

1.G

SM

6581

41.

GS

M65

871

1.G

SM

6587

41.

GS

M65

787

1.G

SM

6533

23.

GS

M65

848

1.G

SM

6532

31.

GS

M65

783

1.G

SM

6577

01.

GS

M65

807

1.G

SM

6578

91.

GS

M65

331

1.G

SM

6531

91.

GS

M65

780

1.G

SM

6535

11.

GS

M65

360

1.G

SM

6534

51.

GS

M65

375

1.G

SM

6535

31.

GS

M65

358

3.G

SM

6536

41.

GS

M65

377

1.G

SM

6537

33.

GS

M65

366

1.G

SM

6537

63.

GS

M65

356

3.G

SM

6535

93.

GS

M65

372

3.G

SM

6535

43.

GS

M65

370

3.G

SM

6575

53.

GS

M65

756

3.G

SM

6583

73.

GS

M65

324

3.G

SM

6532

53.

GS

M65

322

3.G

SM

6532

13.

GS

M65

760

1.G

SM

6534

83.

GS

M65

831

3.G

SM

6534

33.

GS

M65

367

1.G

SM

6534

73.

GS

M65

326

3.G

SM

6537

81.

GS

M65

340

3.G

SM

6535

73.

GS

M65

795

1.G

SM

6580

53.

GS

M65

339

3.G

SM

6533

53.

GS

M65

769

3.G

SM

6536

93.

GS

M65

333

3.G

SM

6581

93.

GS

M65

371

3.G

SM

6535

23.

GS

M65

374

3.G

SM

6534

43.

GS

M65

350

202917_s_at : S100A8 220414_at : CALML5 204475_at : MMP1 201195_s_at : SLC7A5 203915_at : CXCL9 204533_at : CXCL10 210163_at : CXCL11 211122_s_at : CXCL11 201291_s_at : TOP2A 203560_at : GGH 202779_s_at : UBE2S 217755_at : HN1 201890_at : RRM2 209773_s_at : RRM2 211762_s_at : KPNA2 203213_at : CDC2 218883_s_at : MLF1IP 218585_s_at : DTL 201292_at : TOP2A 218662_s_at : NCAPG 203554_x_at : PTTG1 210052_s_at : TPX2 203764_at : DLGAP5 209714_s_at : CDKN3 214710_s_at : CCNB1 219148_at : PBK 204170_s_at : CKS2 202870_s_at : CDC20 202705_at : CCNB2 204962_s_at : CENPA 202095_s_at : BIRC5 202954_at : UBE2C 208079_s_at : AURKA 201088_at : KPNA2 204033_at : TRIP13 204825_at : MELK 218355_at : KIF4A 205034_at : CCNE2 206102_at : GINS1 204641_at : NEK2 207828_s_at : CENPF 218009_s_at : PRC1 202503_s_at : KIAA0101 218039_at : NUSAP1 204026_s_at : ZWINT 203362_s_at : MAD2L1 210559_s_at : CDC2 222077_s_at : RACGAP1 205357_s_at : AGTR1 209189_at : FOS 201693_s_at : EGR1 202768_at : FOSB 203929_s_at : MAPT 203438_at : STC2 203439_s_at : STC2 203130_s_at : KIF5C 219197_s_at : SCUBE2 204014_at : DUSP4 201860_s_at : PLAT 205898_at : CX3CR1 217889_s_at : CYBRD1 221796_at : NTRK2 204731_at : TGFBR3 205710_at : LRP2 204823_at : NAV3 205381_at : LRRC17 212865_s_at : COL14A1 209291_at : ID4 213933_at : PTGER3

Clustering absolute log2 levels

−3 −1 1 2 3Value

Color Key

Figure : Gene Expression Levels in Primary Breast Cancer TumorSamples

58 / 60

Page 59: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Data Security in BioInformatics

I New attacks on anonymized data present a risk to patientsprivacy.

I Very few cloud services guarantee secure processing.I Highly distributed systems add even more attack vectors.I Attacks prompted a Presidential commission on WGS

privacy.

59 / 60

Page 60: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Data Security: A Freebie!

I Random Projection Offers Some ProtectionsI The only full vectors transmitted are cluster centroids,

which by definition are an aggregate of many vectors.I Showing dissimilarity should be somewhat straightforward

I v =√

nk RT u, v ′ =

√kn vT R̂−1

I similarity = ||v , v ′||2

60 / 60

Page 61: Random Projection, Generative Lattices, and Redundancy to ...homepages.uc.edu/~carrahle/documents/gradPresentation.pdfI Parallel Computing (CUDA, MPI) I Distributed Computing (Mapreduce,

Questions ?

61 / 60