Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to...

Author
jimenawindes 
Category
Documents

view
213 
download
1
Embed Size (px)
Transcript of Https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to...
 Slide 1
https://portal.futuregrid.org Parallel Deterministic Annealing Clustering and its Application to LCMS Data Analysis October 7 2013 IEEE International Conference on Big Data 2013 (IEEE BigData 2013) Santa Clara CA Geoffrey Fox, D. R. Mani, Saumyadipta Pyne gcf[email protected]@indiana.edu http://www.infomall.orghttp://www.infomall.org School of Informatics and Computing Indiana University Bloomington Slide 2 https://portal.futuregrid.org Trimmed Clustering Clustering with positionspecific constraints on variance: Applying redescending Mestimators to labelfree LCMS data analysis (Rudolf Frhwirth, D R Mani and Saumyadipta Pyne) BMC Bioinformatics 2011, 12:358 H TCC = k=0 K i=1 N M i (k) f(i,k) f(i,k) = (X(i)  Y(k)) 2 /2 (k) 2 k > 0 f(i,0) = c 2 / 2 k = 0 The 0th cluster captures (at zero temperature) all points outside clusters (background) Clusters are trimmed (X(i)  Y(k)) 2 /2 (k) 2 < c 2 / 2 Relevant when well defined errors T ~ 0 T = 1 T = 5 Distance from cluster center Slide 9 https://portal.futuregrid.org Key Features of Proteomics 2dimensional data arising from a list of peaks specified by points (m/Z, RT), where m/Z is the mass to charge ratio, and RT the retention time for the peak representing by a peptide. Measurement errors (m/Z) = 5.98 10 6 m/Z and (RT) = 2.35 Ratio of errors drastically different from ratio of dimensions of (m/Z, RT) space which could distort high temperature limit solve by annealing (m/Z) 2D (x) = ((m/Z cluster center m/Z x )/ (m/Z)) 2 + ((RT cluster center RT x )/ (RT)) 2 is model 9 Slide 10 https://portal.futuregrid.org General Features of DA In many problems, decreasing temperature is classic multiscale finer resolution (T is just distance scale) We have factors like (X(i) Y(k)) 2 / T In clustering, one then looks at second derivative matrix (can derive analytically) of Free Energy wrt each cluster position and as temperature is lowered this develops negative eigenvalue Or have multiple clusters at each center and perturb Tradeoff depends on problem high dimension takes time to find eigenvector; we use eigenvectors here as 2D This is a phase transition and one splits cluster into two and continues EM iteration till desired resolution reached One can start with just one cluster 10 Slide 11 https://portal.futuregrid.org11 System becomes unstable as Temperature lowered and there is a phase transition and one splits cluster into two and continues EM iteration One can start with just one cluster and need NOT specify desired # clusters; rather specify cluster resolution Rose, K., Gurewitz, E., and Fox, G. C. ``Statistical mechanics and phase transitions in clustering,'' Physical Review Letters, 65(8):945948, August 1990. My #6 most cited article (456 cites including 16 in 2013) Slide 12 https://portal.futuregrid.org Proteomics 2D DA Clustering T= 25000 with 60 Clusters (will be 30,000 at T=0.025) Slide 13 https://portal.futuregrid.org The brownish triangles are sponge peaks outside any cluster. The colored hexagons are peaks inside clusters with the white hexagons being determined cluster center 13 Fragment of 30,000 Clusters 241605 Points Slide 14 https://portal.futuregrid.org Continuous Clustering This is a very useful subtlety introduced by Ken Rose but not widely known although it greatly improves algorithm Take a cluster k to be split into 2 with centers Y(k) A and Y(k) B with initial values Y(k) A = Y(k) B at original center Y(k) Then typically if you make this change and perturb the Y(k) A and Y(k) B, they will return to starting position as F at stable minimum (positive eigenvalue) But instability (the negative eigenvalue) can develop and one finds Implement by adding arbitrary number p(k) of centers for each cluster Z i = k=1 K p(k) exp( i (k)/T) and M step gives p(k) = C(k)/N Halve p(k) at splits; cant split easily in standard case p(k) = 1 Show weighting in sums like Z i now equipoint not equicluster as p(k) proportional to points C(k) in cluster 14 Free Energy F Y(k) A and Y(k) B Y(k) A + Y(k) B Free Energy F Y(k) A  Y(k) B Slide 15 https://portal.futuregrid.org Deterministic Annealing for Proteomics 15 Slide 16 https://portal.futuregrid.org Proteomics Clustering Methods DAVS(c) is Parallel Trimmed DA clustering with clusters satisfying 2D (x) c 2 ; c annealed from large value to given value at T~2 DA2D scales m/Z and RT so clusters circular but does NOT trim them; t here are no sponge points; there are clusters with 1 point All use start with 1 cluster center and Continuous Clustering Anneal (m/Z) from ~1 at T = to 5.98 10 6 m/Z at T~10 Anneal c from T=40 to 2 (only introduce trimming at low temperatures) Mclust uses standard modelbased non deterministic annealing clustering Landmarks are a collection of reference peaks (obtained by identifying a subset of peaks using MS/MS peptide sequencing). 16 Slide 17 https://portal.futuregrid.org Cluster Count v. Temperature for 2 Runs All start with one cluster at far left T=1 special as measurement errors divided out DA2D counts clusters with 1 member as clusters. DAVS(2) does not Slide 18 https://portal.futuregrid.org 18 Landmark Histograms of number of peaks in clusters for 4 clustering methods and the landmark set. Note lowest bin is clusters with one member peak, i.e. unclustered singletons. For DAVS these are Sponge peaks. Slide 19 https://portal.futuregrid.org Basic Statistics Error is mean squared deviation of points from center in each dimension sensitive to cut in 2D (x) DAVS(3) produces most large clusters; Mclust fewest Mclust has many more clusters with just one member DA2D similar to DAVS(2) except has some (probably false) associations of points far from center to cluster 19 ChargeMethod Number Clusters Number of Clusters with occupation count Count > 1 Scaled Error Count > 30 Scaled Error 1 (Sponge) 2>2>30m/zRTm/zRT 2DAVS(1)7323842815106771974611110.0810.0840.0360.056 2DAVS(2)5405519449108242378112570.1910.2050.0790.100 2DAVS(3)419291370871632105815970.4040.3550.2470.290 2DA2D486051303095632254512546.023.680.1000.120 2Mclust842195068914293192379170.0480.0600.0210.041 Slide 20 https://portal.futuregrid.org DAVS(2) and DA2D discover 1020 of 1023 Landmark peaks with modest error 20 Slide 21 https://portal.futuregrid.org 21 Histograms of 2D (x) for 4 different clusters methods, and the landmark set plus expectation for a Gaussian distribution with standard deviations given as (m/z)/3 and (RT)/3 in two directions. The Landmark distribution correspond to previously identified peaks used as a control set. Note DAVS(1) and DAVS(2) have sharp cut offs at 2D (x) = 1 and 4 respectively. Only clusters with more than 50 members are plotted Slide 22 https://portal.futuregrid.org Basic Equations N Number of Points and K Clusters NK Unknowns determined by i (k) = (X i  Y(k)) 2 = p(k) exp(  i (k)/T ) / k=1 K p(k) exp(  i (k)/T ) C(k) = i=1 N Number of points in Cluster k Y(k) = i=1 N X i / C(k) p(k) = C(k) / N Iterate T = to 0.025 is probability that point i in cluster k 22 Slide 23 https://portal.futuregrid.org Simple Parallelism as in kmeans Decompose points i over processors Equations either pleasingly parallel maps over i Or AllReductions summing over i for each cluster Parallel Algorithm: Each process holds all clusters and calculates contributions to clusters from points in node e.g. Y(k) = i=1 N X i / C(k) Runs well in MPI or MapReduce See all the MapReduce kmeans papers 23 Slide 24 https://portal.futuregrid.org Better Parallelism The previous model is correct at start but each point does not really contribute to each cluster as damped exponentially by exp(  (X i  Y(k)) 2 /T ) For Proteomics problem, on average only 6.45 clusters needed per point if require (X i  Y(k)) 2 /T ~40 (as exp(40) small) So only need to keep nearby clusters for each point As average number of Clusters ~ 20,000, this gives a factor of ~3000 improvement Further communication is no longer all global; it has nearest neighbor components and calculated by parallelism over clusters 24 Slide 25 https://portal.futuregrid.org 25 Speedups for several runs on Tempest from 8way through 384 way MPI parallelism with one thread per process. We look at different choices for MPI processes which are either inside nodes or on separate nodes Slide 26 https://portal.futuregrid.org Online/Realtime Clustering Given a existing clustering, one can add new data in two ways Simplest is of course to interpolate new points to nearest existing cluster Better is to add new points and rerun full algorithm starting at T~1 where convergence is in range T=0.1 to 0.01 Takes 20% to 30% original execution time 26 Slide 27 https://portal.futuregrid.org Summary Deterministic Annealing provides quality results keeping us healthy and running in model DAVS(c) or unconstrained fashion DA2D User can choose trade offs given by cut off c Parallel version gives a fast automatic initial analysis of LCMS peaks with no user input needed including no input on final number of clusters Little known Continuous Clustering useful Current open source code available but best wait till we finish conversion from C# to Java Parallel approach subtle as like particle in cell codes, have parallelism over clusters (cells) and/or points (particles) ? Useful different benchmark for compilers etc. Similar ideas relevant for other clustering and deterministic annealing fields such as non metric spaces, MDS 27 Slide 28 https://portal.futuregrid.org Extras 28 Slide 29 https://portal.futuregrid.org 29 Start at T= with 1 Cluster Decrease T, Clusters emerge at instabilities Slide 30 https://portal.futuregrid.org 30 Slide 31 https://portal.futuregrid.org 31 Slide 32 https://portal.futuregrid.org Clusters v. Regions In Lymphocytes clusters are distinct In Pathology, clusters divide space into regions and sophisticated methods like deterministic annealing are probably unnecessary 32 Pathology 54D Lymphocytes 4D Slide 33 https://portal.futuregrid.org Protein Universe Browser for COG Sequences with a few illustrative biologically identified clusters 33 Slide 34 https://portal.futuregrid.org Proteomics 2D DA Clustering T=0.1 small sample of ~30,000 Clusters Count >=2 34 Orange sponge points Outliers not in cluster Yellow triangles Centers Slide 35 https://portal.futuregrid.org Remarks on Clustering and MDS The standard data libraries (R, Matlab, Mahout) do not have best algorithms/software in either functionality or scalable parallelism A lot of algorithms are built around classic full matrix kernels Clustering, Gaussian Mixture Models, PLSI (probabilistic latent semantic indexing), LDA (Latent Dirichlet Allocation) similar MultiDimensional Scaling (MDS) classic information visualization algorithm for high dimension spaces (map preserving distances) Vector O(N) and Non Vector semimetric O(N 2 ) space cases for N points; all apps are points in spaces not all Proper linear spaces Trying to release ~most powerful (in features/performance) available Clustering and MDS library although unfortunately in C# Supported Features: Vector, NonVector, Deterministic annealing, Hierarchical, sharp (trimmed) or general cluster sizes, Fixed points and general weights for MDS, (generalized Elkans algorithm) 35 Slide 36 https://portal.futuregrid.org General Deterministic Annealing For some cases such as vector clustering and Mixture Models one can do integrals by hand but usually that will be impossible So introduce Hamiltonian H 0 ( , ) which by choice of can be made similar to real Hamiltonian H R ( ) and which has tractable integrals P 0 ( ) = exp(  H 0 ( )/T + F 0 /T ) approximate Gibbs for H R F R (P 0 ) =  0 =  0 + F 0 (P 0 ) Where  0 denotes d P o ( ) Easy to show that real Free Energy (the Gibbs inequality) F R (P R ) F R (P 0 ) (KullbackLeibler divergence) Expectation step E is find minimizing F R (P 0 ) and Follow with M step (of EM) setting =  0 = d P o ( ) (mean field) and one follows with a traditional minimization of remaining parameters 36 Note 3 types of variables used to approximate real Hamiltonian subject to annealing The rest optimized by traditional methods Slide 37 https://portal.futuregrid.org Implementation of DAPWC Clustering variables are again M i (k) (these are in general approach) where this is probability point i belongs to cluster k Pairwise Clustering Hamiltonian given by nonlinear form H PWC = 0.5 i=1 N j=1 N (i, j) k=1 K M i (k) M j (k) / C(k) (i, j) is pairwise distance between points i and j with C(k) = i=1 N M i (k) as number of points in Cluster k Take same form H 0 = i=1 N k=1 K M i (k) i (k) as for central clustering i (k) determined to minimize F PWC (P 0 ) =  0 where integrals can be easily done And now linear (in M i (k)) H 0 and quadratic H PC are different Again = exp(  i (k)/T ) / k=1 K exp(  i (k)/T ) 37 Slide 38 https://portal.futuregrid.org Some Ideas Deterministic annealing is better than many wellused optimization problems Started as Elastic Net by Durbin for Travelling Salesman Problem TSP Basic idea behind deterministic annealing is mean field approximation, which is also used in Variational Bayes and Variational inference Markov chain Monte Carlo (MCMC) methods are roughly single temperature simulated annealing Less sensitive to initial conditions Avoid local optima Not equivalent to trying random initial starts Slide 39 https://portal.futuregrid.org Some Uses of Deterministic Annealing Clustering Vectors: Rose (Gurewitz and Fox) Clusters with fixed sizes and no tails (Proteomics team at Broad) No Vectors: Hofmann and Buhmann (Just use pairwise distances) Dimension Reduction for visualization and analysis Vectors: GTM Generative Topographic Mapping No vectors SMACOF: Multidimensional Scaling) MDS (Just use pairwise distances) Can apply to HMM & general mixture models (less study) Gaussian Mixture Models Probabilistic Latent Semantic Analysis with Deterministic Annealing DAPLSA as alternative to Latent Dirichlet Allocation for finding hidden factors Slide 40 https://portal.futuregrid.org 40 Histograms of 2D (x) for 4 different clusters methods, and the landmark set plus expectation for a Gaussian distribution with standard deviations given as 1/3 in the two directions. The Landmark distribution correspond to previously identified peaks used as a control set. Note DAVS(1) and DAVS(2) have sharp cut offs at 2D (x) = 1 and 4 respectively. Only clusters with more than 5 peaks are plotted Slide 41 https://portal.futuregrid.org Some Problems Analysis of Mass Spectrometry data to find peptides by clustering peaks (Broad Institute) ~0.5 million points in 2 dimensions (one experiment)  ~ 50,000 clusters summed over charges Metagenomics 0.5 million (increasing rapidly) points NOT in a vector space hundreds of clusters per sample Pathology Images >50 Dimensions Social image analysis is in a highish dimension vector space 1050 million images; 1000 features per image; million clusters Finding communities from network graphs coming from Social media contacts etc. No vector space; can be huge in all ways 41 Slide 42 https://portal.futuregrid.org 42 Speedups for several runs on Madrid from sequential through 128 way parallelism defined as product of number of threads per process and number of MPI processes. We look at different choices for MPI processes which are either inside nodes or on separate nodes. For example 16 way parallelism shows 3 choices with thread count 1:16 processes on one node (the fastest), 2 processes on 8 nodes and 8 processes on 2 nodes Slide 43 https://portal.futuregrid.org 43 Parallelism within a Single Node of Madrid Cluster. A set of runs on 241605 peak data with a single node with 16 cores with either threads or MPI giving parallelism. Parallelism is either number threads or number of MPI processes. Parallelism (#threads or #processes) Slide 44 https://portal.futuregrid.org Proteomics 2D DA Clustering T=0.1 small sample of ~30,000 Clusters Count >=2 44 Sponge Peaks Centers