Https:// Deterministic Annealing Indiana University CS Theory group January 23...

of 63 /63 Deterministic Annealing Indiana University CS Theory group January 23 2012 Geoffrey Fox [email protected] Director, Digital Science Center, Pervasive Technology Institute Associate Dean for Research and Graduate Studies, School of Informatics and Computing Indiana University Bloomington

Embed Size (px)

Transcript of Https:// Deterministic Annealing Indiana University CS Theory group January 23...

Slide 1

Deterministic AnnealingIndiana UniversityCS Theory groupJanuary 23 2012Geoffrey [email protected]

Director, Digital Science Center, Pervasive Technology InstituteAssociate Dean for Research and Graduate Studies, School of Informatics and ComputingIndiana University Bloomington 1AbstractWe discuss general theory behind deterministic annealing and illustrate with applications to mixture models (including GTM and PLSA), clustering and dimension reduction. We cover cases where the analyzed space has a metric and cases where it does not. We discuss the many open issues and possible further work for methods that appear to outperform the standard approaches but are in practice not used.2 ReferencesKen Rose, Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems. Proceedings of the IEEE, 1998. 86: p. 2210--2239.References earlier papers including his Caltech Elec. Eng. PhD 1990T Hofmann, JM Buhmann, Pairwise data clustering by deterministic annealing, IEEE Transactions on Pattern Analysis and Machine Intelligence 19, pp1-13 1997.Hansjrg Klock and Joachim M. Buhmann, Data visualization by multidimensional scaling: a deterministic annealing approach, Pattern Recognition, Volume 33, Issue 4, April 2000, Pages 651-669.Frhwirth R, Waltenberger W: Redescending M-estimators and Deterministic Annealing, with Applications to Robust Regression and Tail Index Estimation. Austrian Journal of Statistics 2008, 37(3&4):301-317.Review algorithm work by Seung-Hee Bae, Jong Youl Choi (Indiana CS PhDs) Some GoalsWe are building a library of parallel data mining tools that have best known (to me) robustness and performance characteristicsBig data needs super algorithms?A lot of statistics tools (e.g. in R) are not the best algorithm and not always well parallelizedDeterministic annealing (DA) is one of better approaches to optimizationTends to remove local optimaAddresses overfittingFaster than simulated annealing

Return to my heritage (physics) with an approach I called Physical Computation (cf. also genetic algs) -- methods based on analogies to naturePhysics systems find true lowest energy state if you anneal i.e. you equilibrate at each temperature as you cool Some Ideas IDeterministic annealing is better than many well-used optimization problemsStarted as Elastic Net by Durbin for Travelling Salesman Problem TSPBasic idea behind deterministic annealing is mean field approximation, which is also used in Variational Bayes and many neural network approachesMarkov chain Monte Carlo (MCMC) methods are roughly single temperature simulated annealing

Less sensitive to initial conditions Avoid local optima Not equivalent to trying random initial starts 5Some non-DA Ideas IIDimension reduction gives Low dimension mappings of data to both visualize and apply geometric hashingNo-vector (cant define metric space) problems are O(N2) For no-vector case, one can develop O(N) or O(NlogN) methods as in Fast Multipole and OctTree methodsMap high dimensional data to 3D and use classic methods developed originally to speed up O(N2) 3D particle dynamics problems 6Uses of Deterministic AnnealingClusteringVectors: Rose (Gurewitz and Fox) Clusters with fixed sizes and no tails (Proteomics team at Broad)No Vectors: Hofmann and Buhmann (Just use pairwise distances)Dimension Reduction for visualization and analysis Vectors: GTMNo vectors: MDS (Just use pairwise distances)Can apply to general mixture models (but less study)Gaussian Mixture ModelsProbabilistic Latent Semantic Analysis with Deterministic Annealing DA-PLSA as alternative to Latent Dirichlet Allocation (typical informational retrieval/global inference topic model) 7Deterministic Annealing IGibbs Distribution at Temperature TP() = exp( - H()/T) / d exp( - H()/T)Or P() = exp( - H()/T + F/T ) Minimize Free Energy combining Objective Function and EntropyF = < H - T S(P) > = d {P()H + T P() lnP()}Where are (a subset of) parameters to be minimizedSimulated annealing corresponds to doing these integrals by Monte CarloDeterministic annealing corresponds to doing integrals analytically (by mean field approximation) and is naturally much faster than Monte CarloIn each case temperature is lowered slowly say by a factor 0.95 to 0.99 at each iteration 8DeterministicAnnealing Minimum evolving as temperature decreases Movement at fixed temperature going to local minima if not initialized correctly

Solve Linear Equations for each temperature

Nonlinear effects mitigated by initializing with solution at previous higher temperatureF({y}, T)Configuration {y} Deterministic Annealing IIFor some cases such as vector clustering and Mixture Models one can do integrals by hand but usually will be impossibleSo introduce Hamiltonian H0(, ) which by choice of can be made similar to real Hamiltonian HR() and which has tractable integralsP0() = exp( - H0()/T + F0/T ) approximate Gibbs for HRFR (P0) = < HR - T S0(P0) >|0 = < HR H0> |0 + F0(P0)Where |0 denotes d Po()Easy to show that real Free Energy (the Gibbs inequality)FR (PR) FR (P0) (Kullback-Leibler divergence)Expectation step E is find minimizing FR (P0) and Follow with M step (of EM) setting = |0 = d Po() (mean field) and one follows with a traditional minimization of remaining parameters

10Note 3 types of variables used to approximate real Hamiltonian subject to annealingThe rest optimized by traditional methods 10Implementation of DA Central ClusteringClustering variables are Mi(k) (these are annealed in general approach) where this is probability point i belongs to cluster k and k=1K Mi(k) = 1In Central or PW Clustering, take H0 = i=1N k=1K Mi(k) i(k)Linear form allows DA integrals to be done analyticallyCentral clustering has i(k) = (X(i)- Y(k))2 and Mi(k) determined by Expectation stepHCentral = i=1N k=1K Mi(k) (X(i)- Y(k))2 Hcentral and H0 are identical = exp( -i(k)/T ) / k=1K exp( -i(k)/T )Centers Y(k) are determined in M step

11 11Implementation of DA-PWCClustering variables are again Mi(k) (these are in general approach) where this is probability point i belongs to cluster kPairwise Clustering Hamiltonian given by nonlinear formHPWC = 0.5 i=1N j=1N (i, j) k=1K Mi(k) Mj(k) / C(k) (i, j) is pairwise distance between points i and jwith C(k) = i=1N Mi(k) as number of points in Cluster kTake same form H0 = i=1N k=1K Mi(k) i(k) as for central clusteringi(k) determined to minimize FPWC (P0) = < HPWC - T S0(P0) >|0 where integrals can be easily doneAnd now linear (in Mi(k)) H0 and quadratic HPC are differentAgain = exp( -i(k)/T ) / k=1K exp( -i(k)/T )

12 12General Features of DADeterministic Annealing DA is related to Variational Inference or Variational Bayes methodsIn many problems, decreasing temperature is classic multiscale finer resolution (T is just distance scale)We have factors like (X(i)- Y(k))2 / TIn clustering, one then looks at second derivative matrix of FR (P0) wrt and as temperature is lowered this develops negative eigenvalue corresponding to instabilityOr have multiple clusters at each center and perturbThis is a phase transition and one splits cluster into two and continues EM iterationOne can start with just one cluster

13 1314

Rose, K., Gurewitz, E., and Fox, G.C. ``Statistical mechanics and phase transitions in clustering,'' Physical Review Letters, 65(8):945-948, August 1990.

My #6 most cited article (402 cites including 15 in 2011) 1415

Start at T= with 1 ClusterDecrease T, Clusters emerge at instabilities 16 17 A(k) = - 0.5 i=1N j=1N (i, j) / 2Bi(k) = j=1N (i, j) / i(k) = (Bi(k) + A(k)) = exp( -i(k)/T )/k=1K exp(-i(k)/T)C(k) = i=1N Loop to converge variables; decrease T from DA-PWC EM Steps (E is red, M Black)k runs over clusters; i,j points18Parallelize by distributing points across processesSteps 1 global sum (reduction)Step 1, 2, 5 local sum if broadcasti pointsk clusters 18Continuous Clustering IThis is a subtlety introduced by Ken Rose but not clearly known in communityLets consider dynamic appearance of clusters a little more carefully. We suppose that we take a cluster k and split into 2 with centers Y(k)A and Y(k)B with initial values Y(k)A = Y(k)B at original center Y(k)Then typically if you make this change and perturb the Y(k)A Y(k)B, they will return to starting position as F at stable minimum

But instability can develop and one finds19Free Energy FY(k)A and Y(k)BY(k)A + Y(k)BFree Energy FFree Energy FY(k)A - Y(k)B Continuous Clustering IIAt phase transition when eigenvalue corresponding to Y(k)A - Y(k)B goes negative, F is a minimum if two split clusters move together but a maximum if they separatei.e. two genuine clusters are formed at instability pointsWhen you split A(k) , Bi(k), i(k) are unchanged and you would hope that cluster counts C(k) and probabilities would be halvedUnfortunately that doesnt work except for 1 cluster splitting into 2 due to factor Zi = k=1K exp(-i(k)/T) with = exp( -i(k)/T )/ Zi Nave solution is to examine explicitly solution with A(k0) , Bi(k0), i(k0) are unchanged; C(k0), halved for 0 0f(i,0) = c2 / 2 k = 0The 0th cluster captures (at zero temperature) all points outside clusters (background)Clusters are trimmed (X(i) - Y(k))2/2(k)2 < c2 / 2 Another case when H0 is same as target HamiltonianProteomics Mass Spectrometry

T ~ 0T = 1T = 5Distance from cluster center High Performance Dimension Reduction and VisualizationNeed is pervasiveLarge and high dimensional data are everywhere: biology, physics, Internet, Visualization can help data analysis Visualization of large datasets with high performanceMap high-dimensional data into low dimensions (2D or 3D).Need Parallel programming for processing large data setsDeveloping high performance dimension reduction algorithms: MDS(Multi-dimensional Scaling)GTM(Generative Topographic Mapping)DA-MDS(Deterministic Annealing MDS) DA-GTM(Deterministic Annealing GTM) Interactive visualization tool PlotViz With David Wild

25Multidimensional Scaling MDSMap points in high dimension to lower dimensionsMany such dimension reduction algorithms (PCA Principal component analysis easiest); simplest but perhaps best at times is MDSMinimize Stress (X) = i