Https://portal.futuregrid.org Deterministic Annealing Indiana University CS Theory group January 23...

Author
noramclaughlin 
Category
Documents

view
212 
download
0
Embed Size (px)
Transcript of Https://portal.futuregrid.org Deterministic Annealing Indiana University CS Theory group January 23...
Slide 1
Deterministic AnnealingIndiana UniversityCS Theory groupJanuary 23 2012Geoffrey [email protected] http://www.infomall.org http://www.futuregrid.org
Director, Digital Science Center, Pervasive Technology InstituteAssociate Dean for Research and Graduate Studies, School of Informatics and ComputingIndiana University Bloomingtonhttps://portal.futuregrid.org 1AbstractWe discuss general theory behind deterministic annealing and illustrate with applications to mixture models (including GTM and PLSA), clustering and dimension reduction. We cover cases where the analyzed space has a metric and cases where it does not. We discuss the many open issues and possible further work for methods that appear to outperform the standard approaches but are in practice not used.2https://portal.futuregrid.org ReferencesKen Rose, Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems. Proceedings of the IEEE, 1998. 86: p. 22102239.References earlier papers including his Caltech Elec. Eng. PhD 1990T Hofmann, JM Buhmann, Pairwise data clustering by deterministic annealing, IEEE Transactions on Pattern Analysis and Machine Intelligence 19, pp113 1997.Hansjrg Klock and Joachim M. Buhmann, Data visualization by multidimensional scaling: a deterministic annealing approach, Pattern Recognition, Volume 33, Issue 4, April 2000, Pages 651669.Frhwirth R, Waltenberger W: Redescending Mestimators and Deterministic Annealing, with Applications to Robust Regression and Tail Index Estimation. http://www.stat.tugraz.at/AJS/ausg083+4/08306Fruehwirth.pdf Austrian Journal of Statistics 2008, 37(3&4):301317.Review http://grids.ucs.indiana.edu/ptliupages/publications/pdac24gfox.pdfRecent algorithm work by SeungHee Bae, Jong Youl Choi (Indiana CS PhDs)http://grids.ucs.indiana.edu/ptliupages/publications/CetraroWriteupJune1109.pdf http://grids.ucs.indiana.edu/ptliupages/publications/hpdc2010_submission_57.pdf
https://portal.futuregrid.org Some GoalsWe are building a library of parallel data mining tools that have best known (to me) robustness and performance characteristicsBig data needs super algorithms?A lot of statistics tools (e.g. in R) are not the best algorithm and not always well parallelizedDeterministic annealing (DA) is one of better approaches to optimizationTends to remove local optimaAddresses overfittingFaster than simulated annealing
Return to my heritage (physics) with an approach I called Physical Computation (cf. also genetic algs)  methods based on analogies to naturePhysics systems find true lowest energy state if you anneal i.e. you equilibrate at each temperature as you cool
https://portal.futuregrid.org Some Ideas IDeterministic annealing is better than many wellused optimization problemsStarted as Elastic Net by Durbin for Travelling Salesman Problem TSPBasic idea behind deterministic annealing is mean field approximation, which is also used in Variational Bayes and many neural network approachesMarkov chain Monte Carlo (MCMC) methods are roughly single temperature simulated annealing
Less sensitive to initial conditions Avoid local optima Not equivalent to trying random initial startshttps://portal.futuregrid.org 5Some nonDA Ideas IIDimension reduction gives Low dimension mappings of data to both visualize and apply geometric hashingNovector (cant define metric space) problems are O(N2) For novector case, one can develop O(N) or O(NlogN) methods as in Fast Multipole and OctTree methodsMap high dimensional data to 3D and use classic methods developed originally to speed up O(N2) 3D particle dynamics problems
https://portal.futuregrid.org 6Uses of Deterministic AnnealingClusteringVectors: Rose (Gurewitz and Fox) Clusters with fixed sizes and no tails (Proteomics team at Broad)No Vectors: Hofmann and Buhmann (Just use pairwise distances)Dimension Reduction for visualization and analysis Vectors: GTMNo vectors: MDS (Just use pairwise distances)Can apply to general mixture models (but less study)Gaussian Mixture ModelsProbabilistic Latent Semantic Analysis with Deterministic Annealing DAPLSA as alternative to Latent Dirichlet Allocation (typical informational retrieval/global inference topic model)https://portal.futuregrid.org 7Deterministic Annealing IGibbs Distribution at Temperature TP() = exp(  H()/T) / d exp(  H()/T)Or P() = exp(  H()/T + F/T ) Minimize Free Energy combining Objective Function and EntropyF = < H  T S(P) > = d {P()H + T P() lnP()}Where are (a subset of) parameters to be minimizedSimulated annealing corresponds to doing these integrals by Monte CarloDeterministic annealing corresponds to doing integrals analytically (by mean field approximation) and is naturally much faster than Monte CarloIn each case temperature is lowered slowly say by a factor 0.95 to 0.99 at each iterationhttps://portal.futuregrid.org 8DeterministicAnnealing Minimum evolving as temperature decreases Movement at fixed temperature going to local minima if not initialized correctly
Solve Linear Equations for each temperature
Nonlinear effects mitigated by initializing with solution at previous higher temperatureF({y}, T)Configuration {y}https://portal.futuregrid.org Deterministic Annealing IIFor some cases such as vector clustering and Mixture Models one can do integrals by hand but usually will be impossibleSo introduce Hamiltonian H0(, ) which by choice of can be made similar to real Hamiltonian HR() and which has tractable integralsP0() = exp(  H0()/T + F0/T ) approximate Gibbs for HRFR (P0) = < HR  T S0(P0) >0 = < HR H0> 0 + F0(P0)Where 0 denotes d Po()Easy to show that real Free Energy (the Gibbs inequality)FR (PR) FR (P0) (KullbackLeibler divergence)Expectation step E is find minimizing FR (P0) and Follow with M step (of EM) setting = 0 = d Po() (mean field) and one follows with a traditional minimization of remaining parameters
10Note 3 types of variables used to approximate real Hamiltonian subject to annealingThe rest optimized by traditional methodshttps://portal.futuregrid.org 10Implementation of DA Central ClusteringClustering variables are Mi(k) (these are annealed in general approach) where this is probability point i belongs to cluster k and k=1K Mi(k) = 1In Central or PW Clustering, take H0 = i=1N k=1K Mi(k) i(k)Linear form allows DA integrals to be done analyticallyCentral clustering has i(k) = (X(i) Y(k))2 and Mi(k) determined by Expectation stepHCentral = i=1N k=1K Mi(k) (X(i) Y(k))2 Hcentral and H0 are identical = exp( i(k)/T ) / k=1K exp( i(k)/T )Centers Y(k) are determined in M step
11https://portal.futuregrid.org 11Implementation of DAPWCClustering variables are again Mi(k) (these are in general approach) where this is probability point i belongs to cluster kPairwise Clustering Hamiltonian given by nonlinear formHPWC = 0.5 i=1N j=1N (i, j) k=1K Mi(k) Mj(k) / C(k) (i, j) is pairwise distance between points i and jwith C(k) = i=1N Mi(k) as number of points in Cluster kTake same form H0 = i=1N k=1K Mi(k) i(k) as for central clusteringi(k) determined to minimize FPWC (P0) = < HPWC  T S0(P0) >0 where integrals can be easily doneAnd now linear (in Mi(k)) H0 and quadratic HPC are differentAgain = exp( i(k)/T ) / k=1K exp( i(k)/T )
12https://portal.futuregrid.org 12General Features of DADeterministic Annealing DA is related to Variational Inference or Variational Bayes methodsIn many problems, decreasing temperature is classic multiscale finer resolution (T is just distance scale)We have factors like (X(i) Y(k))2 / TIn clustering, one then looks at second derivative matrix of FR (P0) wrt and as temperature is lowered this develops negative eigenvalue corresponding to instabilityOr have multiple clusters at each center and perturbThis is a phase transition and one splits cluster into two and continues EM iterationOne can start with just one cluster
13https://portal.futuregrid.org 1314
Rose, K., Gurewitz, E., and Fox, G.C. ``Statistical mechanics and phase transitions in clustering,'' Physical Review Letters, 65(8):945948, August 1990.
My #6 most cited article (402 cites including 15 in 2011)https://portal.futuregrid.org 1415
Start at T= with 1 ClusterDecrease T, Clusters emerge at instabilitieshttps://portal.futuregrid.org 16
https://portal.futuregrid.org 17
https://portal.futuregrid.org A(k) =  0.5 i=1N j=1N (i, j) / 2Bi(k) = j=1N (i, j) / i(k) = (Bi(k) + A(k)) = exp( i(k)/T )/k=1K exp(i(k)/T)C(k) = i=1N Loop to converge variables; decrease T from DAPWC EM Steps (E is red, M Black)k runs over clusters; i,j points18Parallelize by distributing points across processesSteps 1 global sum (reduction)Step 1, 2, 5 local sum if broadcasti pointsk clustershttps://portal.futuregrid.org 18Continuous Clustering IThis is a subtlety introduced by Ken Rose but not clearly known in communityLets consider dynamic appearance of clusters a little more carefully. We suppose that we take a cluster k and split into 2 with centers Y(k)A and Y(k)B with initial values Y(k)A = Y(k)B at original center Y(k)Then typically if you make this change and perturb the Y(k)A Y(k)B, they will return to starting position as F at stable minimum
But instability can develop and one finds19Free Energy FY(k)A and Y(k)BY(k)A + Y(k)BFree Energy FFree Energy FY(k)A  Y(k)Bhttps://portal.futuregrid.org Continuous Clustering IIAt phase transition when eigenvalue corresponding to Y(k)A  Y(k)B goes negative, F is a minimum if two split clusters move together but a maximum if they separatei.e. two genuine clusters are formed at instability pointsWhen you split A(k) , Bi(k), i(k) are unchanged and you would hope that cluster counts C(k) and probabilities would be halvedUnfortunately that doesnt work except for 1 cluster splitting into 2 due to factor Zi = k=1K exp(i(k)/T) with = exp( i(k)/T )/ Zi Nave solution is to examine explicitly solution with A(k0) , Bi(k0), i(k0) are unchanged; C(k0), halved for 0 0f(i,0) = c2 / 2 k = 0The 0th cluster captures (at zero temperature) all points outside clusters (background)Clusters are trimmed (X(i)  Y(k))2/2(k)2 < c2 / 2 Another case when H0 is same as target HamiltonianProteomics Mass Spectrometry
T ~ 0T = 1T = 5Distance from cluster centerhttps://portal.futuregrid.org High Performance Dimension Reduction and VisualizationNeed is pervasiveLarge and high dimensional data are everywhere: biology, physics, Internet, Visualization can help data analysis Visualization of large datasets with high performanceMap highdimensional data into low dimensions (2D or 3D).Need Parallel programming for processing large data setsDeveloping high performance dimension reduction algorithms: MDS(Multidimensional Scaling)GTM(Generative Topographic Mapping)DAMDS(Deterministic Annealing MDS) DAGTM(Deterministic Annealing GTM) Interactive visualization tool PlotVizhttps://portal.futuregrid.org With David Wild
25Multidimensional Scaling MDSMap points in high dimension to lower dimensionsMany such dimension reduction algorithms (PCA Principal component analysis easiest); simplest but perhaps best at times is MDSMinimize Stress (X) = i