Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to...

44
https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International Conference on Big Data 2013 (IEEE BigData 2013) Santa Clara CA Geoffrey Fox, D. R. Mani, Saumyadipta Pyne [email protected] http://www.infomall.org School of Informatics and Computing Indiana University Bloomington

Transcript of Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to...

Page 1: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

Parallel Deterministic Annealing Clustering and its Application

to LC-MS Data Analysis October 7 2013

IEEE International Conference on Big Data 2013 (IEEE BigData 2013)Santa Clara CA

Geoffrey Fox, D. R. Mani, Saumyadipta Pyne

[email protected] http://www.infomall.orgSchool of Informatics and Computing

Indiana University Bloomington

Page 2: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 2

Challenge• Deterministic Annealing introduced ~1990 for clustering but no

broadly available implementations although most tests rate well• 2000 Extended to non metric spaces (Hofmann and Buhmann)• 2010-2013 Applied to Dimension reduction, PLSI, Gaussian mixtures

etc. (Fox et al.)• 2011 Applied to model dependent single cluster at a time “peak

matching" problem of the precise identification of the common LC-MS peaks across a cohort of multiple biological samples in proteomic biomarker discovery for data from a human tuberculosis cohort. (Frühwirth, Mani and Pyne)

• Here apply “multi-cluster”, “annealed models”, “Continuous Clustering”, “parallelism” to proteomics case giving high performance automatic robust method to quickly analyze proteomics samples such as those taken in rapidly spreading epidemics

Page 3: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 3

Deterministic AnnealingAlgorithms

Page 4: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

Some Motivation• Big Data requires high performance – achieve with parallel computing• Big Data sometimes requires robust algorithms as more opportunity

to make mistakes• Deterministic annealing (DA) is one of better approaches to robust

optimization– Started as “Elastic Net” by Durbin for Travelling Salesman Problem

TSP– Tends to remove local optima– Addresses overfitting– Much Faster than simulated annealing

• Physics systems find true lowest energy state if you anneal i.e. you equilibrate at each temperature as you cool

• Uses mean field approximation, which is also used in “Variational Bayes” and “Variational inference”

Page 5: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 5

(Deterministic) Annealing• Find minimum at high temperature when trivial• Small change avoiding local minima as lower temperature• Typically gets better answers than standard libraries- R and Mahout• And can be parallelized and put on GPU’s etc.

Page 6: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

Basic Deterministic Annealing• H() is objective function to be minimized as a function of

parameters • Gibbs Distribution at Temperature T

P() = exp( - H()/T) / d exp( - H()/T)• Or P() = exp( - H()/T + F/T ) • Minimize Free Energy combining Objective Function and Entropy

F = < H - T S(P) > = d {P()H + T P() lnP()}• Simulated annealing performs these integrals by Monte Carlo• Deterministic annealing corresponds to doing integrals analytically

(by mean field approximation) and is much much faster • In each case temperature is lowered slowly – say by a factor 0.95

to 0.9999 at each iteration• Start with one cluster (all there is at T = ∞ ), others emerge

automatically as T decreases

Page 7: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

Implementation of DA Central Clustering• Here points to be clustered are in a metric space

• Clustering variables are Mi(k) where this is probability that point i belongs to cluster k and k=1

K Mi(k) = 1

• In Central or PW Clustering, take H0 = i=1N k=1

K Mi(k) i(k)

– Linear form allows DA integrals to be done analytically

• Central clustering has i(k) = (X(i)- Y(k))2 and Mi(k) determined by Expectation E step

– HCentral = i=1N k=1

K Mi(k) (X(i)- Y(k))2

• <Mi(k)> = exp( -i(k)/T ) / k=1K exp( -i(k)/T )

• Centers Y(k) are determined in M step of EM method7

Page 8: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

Trimmed Clustering• Clustering with position-specific constraints on variance: Applying

redescending M-estimators to label-free LC-MS data analysis (Rudolf Frühwirth , D R Mani and Saumyadipta Pyne) BMC Bioinformatics 2011, 12:358

• HTCC = k=0K i=1

N Mi(k) f(i,k)– f(i,k) = (X(i) - Y(k))2/2(k)2 k > 0– f(i,0) = c2 / 2 k = 0

• The 0’th cluster captures (at zero temperature) all points outside clusters (background)

• Clusters are trimmed (X(i) - Y(k))2/2(k)2 < c2 / 2

• Relevant when well defined errors

T ~ 0T = 1

T = 5

Distance from cluster center

Page 9: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 9

Key Features of Proteomics• 2-dimensional data arising from a list of peaks

specified by points (m/Z, RT), where m/Z is the mass to charge ratio, and RT the retention time for the peak representing by a peptide.

• Measurement errors (m/Z) = 5.98 10-6 m/Z and (RT) = 2.35– Ratio of errors drastically different from ratio of

dimensions of (m/Z, RT) space which could distort high temperature limit – solve by annealing (m/Z)

• 2D(x) = ((m/Z|cluster center – m/Z|x )/ (m/Z))2 + ((RT|cluster center – RT|x )/ (RT))2 is model

Page 10: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

General Features of DA• In many problems, decreasing temperature is classic

multiscale – finer resolution (√T is “just” distance scale)– We have factors like (X(i)- Y(k))2 / T

• In clustering, one then looks at second derivative matrix (can derive analytically) of Free Energy wrt each cluster position and as temperature is lowered this develops negative eigenvalue – Or have multiple clusters at each center and perturb– Trade-off depends on problem – high dimension takes time to

find eigenvector; we use eigenvectors here as 2D• This is a phase transition and one splits cluster into two

and continues EM iteration till desired resolution reached• One can start with just one cluster

10

Page 11: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 11

• System becomes unstable as Temperature lowered and there is a phase transition and one splits cluster into two and continues EM iteration

• One can start with just one cluster and need NOT specify desired # clusters; rather specify cluster resolution

Rose, K., Gurewitz, E., and Fox, G. C. ``Statistical mechanics and phase transitions in clustering,'' Physical Review Letters, 65(8):945-948, August 1990.

My #6 most cited article (456 cites including 16 in 2013)

Page 12: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

Proteomics 2D DA Clustering T= 25000 with 60 Clusters (will be 30,000 at T=0.025)

Page 13: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 13

The brownish triangles are sponge peaks outside any cluster.

The colored hexagons are peaks inside clusters with the white hexagons being determined cluster center

Fragment of 30,000 Clusters241605 Points

Page 14: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 14

Continuous Clustering• This is a very useful subtlety introduced by Ken Rose but not widely known

although it greatly improves algorithm• Take a cluster k to be split into 2 with centers Y(k)A and Y(k)B with initial

values Y(k)A = Y(k)B at original center Y(k)• Then typically if you make this change

and perturb the Y(k)A and Y(k)B, they will return to starting position as F at stable minimum (positive eigenvalue)

• But instability (the negative eigenvalue) can develop and one finds

• Implement by adding arbitrary number p(k) of centers for each cluster Zi = k=1

K p(k) exp(-i(k)/T) and M step gives p(k) = C(k)/N• Halve p(k) at splits; can’t split easily in standard case p(k) = 1• Show weighting in sums like Zi now equipoint not equicluster as p(k)

proportional to points C(k) in cluster

Free Energy F

Y(k)A and Y(k)B

Y(k)A + Y(k)B

Free Energy F Free Energy F

Y(k)A - Y(k)B

Page 15: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 15

Deterministic Annealingfor Proteomics

Page 16: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 16

Proteomics Clustering Methods• DAVS(c) is Parallel Trimmed DA clustering with clusters satisfying

2D(x) ≤ c2 ; c annealed from large value to given value at T~2• DA2D scales m/Z and RT so clusters “circular” but does NOT trim

them; there are no “sponge points”; there are clusters with 1 point

• All use start with 1 cluster center and – “Continuous Clustering” – Anneal (m/Z) from ~1 at T = ∞ to 5.98 10-6 m/Z at T~10– Anneal c from T=40 to 2 (only introduce trimming at low temperatures)

• Mclust uses standard model-based non deterministic annealing clustering

• Landmarks are a collection of reference peaks (obtained by identifying a subset of peaks using MS/MS peptide sequencing).

Page 17: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

1.00E-031.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+060

10000

20000

30000

40000

50000

60000

DAVS(2) DA2D

Temperature

Clus

ter C

ount

Start Sponge DAVS(2)

Add Close Cluster Check

Sponge Reaches final value

Cluster Count v. Temperature for 2 Runs

• All start with one cluster at far left• T=1 special as measurement errors divided out• DA2D counts clusters with 1 member as clusters. DAVS(2) does not

Page 18: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 18

1

10

100

1000

10000

100000

0 20 40 60 80 100 120 140

#Clu

ster

s w

ith O

ccup

ation

Cou

nt

Occupation Count (Peaks in Cluster)

DAVS(1) DAVS(2) DA2D Mclust GoldenLandmark

Histograms of number of peaks in clusters for 4 clustering methods and the landmark set. Note lowest bin is clusters with one member peak, i.e. unclustered singletons. For DAVS these are Sponge peaks.

Page 19: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 19

Basic Statistics

• Error is mean squared deviation of points from center in each dimension – sensitive to cut in 2D(x)

• DAVS(3) produces most large clusters; Mclust fewest• Mclust has many more clusters with just one member • DA2D similar to DAVS(2) except has some (probably false)

associations of points far from center to cluster

Charge Method Number Clusters

Number of Clusters with occupation count

Count > 1Scaled Error

Count > 30Scaled Error

1 (Sponge) 2 >2 >30 m/z RT m/z RT

2 DAVS(1) 73238 42815 10677 19746 1111 0.081 0.084 0.036 0.056

2 DAVS(2) 54055 19449 10824 23781 1257 0.191 0.205 0.079 0.100

2 DAVS(3) 41929 13708 7163 21058 1597 0.404 0.355 0.247 0.290

2 DA2D 48605 13030 9563 22545 1254 6.02 3.68 0.100 0.120

2 Mclust 84219 50689 14293 19237 917 0.048 0.060 0.021 0.041

Page 20: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 20

• DAVS(2) and DA2D discover 1020 of 1023 Landmark peaks with modest error

Charge 2 Landmark Peaks >3 Peaks in Cluster (at least 3 match)

Method

Number Landmark

Clusters Found

Scaled Error2D(x) ≤ 0.7 #

Landmark Peaks

# Non Landmark

Peaksm/Z RT

DAVS(1) 1017 0.033 0.050 24650 6330

DAVS(2) 1020 0.034 0.052 24717 6716

DAVS(3) 1020 0.075 0.103 24239 8242

DA2D 1020 0.035 0.054 23691 6709

Mclust 1005 0.033 0.047 23394 4856

Landmark 1023 0.000 0.000 24855 0

Page 21: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 21

Histograms of 2D(x) for 4 different clusters methods, and the landmark set plus expectation for a Gaussian distribution with standard deviations given as (m/z)/3 and (RT)/3 in two directions. The “Landmark” distribution correspond to previously identified peaks used as a control set. Note DAVS(1) and DAVS(2) have sharp cut offs at 2D(x) = 1 and 4 respectively. Only clusters with more than 50 members are plotted

Page 22: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 22

Basic Equations• N Number of Points and K Clusters

• NK Unknowns <Mi(k)> determined by

• i(k) = (Xi- Y(k))2

• <Mi(k)> = p(k) exp( -i(k)/T ) / k=1K p(k) exp( -i(k)/T )

• C(k) = i=1N <Mi(k)> Number of points in Cluster k

• Y(k) = i=1N <Mi(k)> Xi / C(k)

• p(k) = C(k) / N

• Iterate T = “∞” to 0.025

• <Mi(k)> is probability that point i in cluster k

Page 23: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 23

Simple Parallelism as in k-means• Decompose points i over processors• Equations either pleasingly parallel “maps” over i• Or “All-Reductions” summing over i for each cluster• Parallel Algorithm:

– Each process holds all clusters and calculates contributions to clusters from points in node

– e.g. Y(k) = i=1N <Mi(k)> Xi / C(k)

• Runs well in MPI or MapReduce– See all the MapReduce k-means papers

Page 24: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 24

Better Parallelism• The previous model is correct at start but each point does

not really contribute to each cluster as damped exponentially by exp( - (Xi- Y(k))2 /T )

• For Proteomics problem, on average only 6.45 clusters needed per point if require (Xi- Y(k))2 /T ≤ ~40 (as exp(-40) small)

• So only need to keep nearby clusters for each point• As average number of Clusters ~ 20,000, this gives a factor

of ~3000 improvement• Further communication is no longer all global; it has

nearest neighbor components and calculated by parallelism over clusters

Page 25: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 25

Speedups for several runs on Tempest from 8-way through 384 way MPI parallelism with one thread per process. We look at different choices for MPI processes which are either inside nodes or on separate nodes

Page 26: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 26

Online/Realtime Clustering

• Given a existing clustering, one can add new data in two ways

• Simplest is of course to interpolate new points to nearest existing cluster

• Better is to add new points and rerun full algorithm starting at T~1 where “convergence” is in range T=0.1 to 0.01 – Takes 20% to 30% original execution time

Page 27: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 27

Summary• Deterministic Annealing provides quality results keeping us healthy

and running in model DAVS(c) or unconstrained fashion DA2D– User can choose trade offs given by cut off c

• Parallel version gives a fast automatic initial analysis of LC-MS peaks with no user input needed including no input on final number of clusters

• Little known “Continuous Clustering” useful• Current open source code available but best wait till we finish

conversion from C# to Java• Parallel approach subtle as like particle in cell codes, have parallelism

over clusters (cells) and/or points (particles)– ? Useful different benchmark for compilers etc.

• Similar ideas relevant for other clustering and deterministic annealing fields such as non metric spaces, MDS

Page 28: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 28

Extras

Page 29: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 29

• Start at T= “” with 1 Cluster

• Decrease T, Clusters emerge at instabilities

Page 30: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 30

Page 31: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 31

Page 32: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 32

Clusters v. Regions

• In Lymphocytes clusters are distinct• In Pathology, clusters divide space into regions and

sophisticated methods like deterministic annealing are probably unnecessary

Pathology 54D

Lymphocytes 4D

Page 33: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

Protein Universe Browser for COG Sequences with a few illustrative biologically identified clusters

33

Page 34: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 34

Proteomics 2D DA Clustering T=0.1small sample of ~30,000 Clusters Count >=2

Orange sponge points Outliers not in cluster

Yellow trianglesCenters

Page 35: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 35

Remarks on Clustering and MDS• The standard data libraries (R, Matlab, Mahout) do not have best

algorithms/software in either functionality or scalable parallelism• A lot of algorithms are built around “classic full matrix” kernels• Clustering, Gaussian Mixture Models, PLSI (probabilistic latent

semantic indexing), LDA (Latent Dirichlet Allocation) similar• Multi-Dimensional Scaling (MDS) classic information visualization

algorithm for high dimension spaces (map preserving distances)• Vector O(N) and Non Vector semimetric O(N2) space cases for N

points; “all” apps are points in spaces – not all “Proper linear spaces”• Trying to release ~most powerful (in features/performance) available

Clustering and MDS library although unfortunately in C#• Supported Features: Vector, Non-Vector, Deterministic annealing,

Hierarchical, sharp (trimmed) or general cluster sizes, Fixed points and general weights for MDS, (generalized Elkans algorithm)

Page 36: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

General Deterministic Annealing • For some cases such as vector clustering and Mixture Models one

can do integrals by hand but usually that will be impossible• So introduce Hamiltonian H0(, ) which by choice of can be made

similar to real Hamiltonian HR() and which has tractable integrals

• P0() = exp( - H0()/T + F0/T ) approximate Gibbs for HR

• FR (P0) = < HR - T S0(P0) >|0 = < HR – H0> |0 + F0(P0)

• Where <…>|0 denotes d Po()• Easy to show that real Free Energy (the Gibb’s inequality)

FR (PR) ≤ FR (P0) (Kullback-Leibler divergence)

• Expectation step E is find minimizing FR (P0) and

• Follow with M step (of EM) setting = <> |0 = d Po() (mean field) and one follows with a traditional minimization of remaining parameters

36

Note 3 types of variables used to approximate real Hamiltonian subject to annealingThe rest – optimized by traditional methods

Page 37: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

Implementation of DA-PWC• Clustering variables are again Mi(k) (these are in general

approach) where this is probability point i belongs to cluster k• Pairwise Clustering Hamiltonian given by nonlinear form• HPWC = 0.5 i=1

N j=1N (i, j) k=1

K Mi(k) Mj(k) / C(k) • (i, j) is pairwise distance between points i and j• with C(k) = i=1

N Mi(k) as number of points in Cluster k• Take same form H0 = i=1

N k=1K Mi(k) i(k) as for central clustering

• i(k) determined to minimize FPWC (P0) = < HPWC - T S0(P0) >|0 where integrals can be easily done

• And now linear (in Mi(k)) H0 and quadratic HPC are different• Again <Mi(k)> = exp( -i(k)/T ) / k=1

K exp( -i(k)/T )

37

Page 38: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

Some Ideas Deterministic annealing is better than many well-used

optimization problems Started as “Elastic Net” by Durbin for Travelling Salesman Problem TSP

Basic idea behind deterministic annealing is mean field approximation, which is also used in “Variational Bayes” and “Variational inference”

Markov chain Monte Carlo (MCMC) methods are roughly single temperature simulated annealing

• Less sensitive to initial conditions

• Avoid local optima• Not equivalent to trying random initial starts

Page 39: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org

Some Uses of Deterministic Annealing• Clustering

– Vectors: Rose (Gurewitz and Fox)

– Clusters with fixed sizes and no tails (Proteomics team at Broad)

– No Vectors: Hofmann and Buhmann (Just use pairwise distances)

• Dimension Reduction for visualization and analysis – Vectors: GTM Generative Topographic Mapping

– No vectors SMACOF: Multidimensional Scaling) MDS (Just use pairwise distances)

• Can apply to HMM & general mixture models (less study)– Gaussian Mixture Models

– Probabilistic Latent Semantic Analysis with Deterministic Annealing DA-PLSA as alternative to Latent Dirichlet Allocation for finding “hidden factors”

Page 40: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 40

Histograms of 2D(x) for 4 different clusters methods, and the landmark set plus expectation for a Gaussian distribution with standard deviations given as 1/3 in the two directions. The “Landmark” distribution correspond to previously identified peaks used as a control set. Note DAVS(1) and DAVS(2) have sharp cut offs at 2D(x) = 1 and 4 respectively. Only clusters with more than 5 peaks are plotted

Page 41: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 41

Some Problems• Analysis of Mass Spectrometry data to find peptides by

clustering peaks (Broad Institute)– ~0.5 million points in 2 dimensions (one experiment) -- ~ 50,000

clusters summed over charges

• Metagenomics – 0.5 million (increasing rapidly) points NOT in a vector space – hundreds of clusters per sample

• Pathology Images >50 Dimensions• Social image analysis is in a highish dimension vector space

– 10-50 million images; 1000 features per image; million clusters

• Finding communities from network graphs coming from Social media contacts etc.– No vector space; can be huge in all ways

Page 42: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 42

Speedups for several runs on Madrid from sequential through 128 way parallelism defined as product of number of threads per process and number of MPI processes. We look at different choices for MPI processes which are either inside nodes or on separate nodes. For example 16-way parallelism shows 3 choices with thread count 1:16 processes on one node (the fastest), 2 processes on 8 nodes and 8 processes on 2 nodes

Page 43: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 43

Parallelism within a Single Node of Madrid Cluster. A set of runs on 241605 peak data with a single node with 16 cores with either threads or MPI giving parallelism. Parallelism is either number threads or number of MPI processes.

Parallelism (#threads or #processes)

Page 44: Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LC-MS Data Analysis October 7 2013 IEEE International.

https://portal.futuregrid.org 44

Proteomics 2D DA Clustering T=0.1small sample of ~30,000 Clusters Count >=2

Sponge Peaks

Centers