Structure Preserving Embedding - Columbia Universityjebara/papers/spe-icml09.pdfStructure Preserving...

Structure Preserving Embedding

Blake Shaw [email protected] Jebara [email protected]

Department of Computer Science, Columbia University, 1214 Amsterdam Ave, New York, NY 10027

Abstract

Structure Preserving Embedding (SPE) isan algorithm for embedding graphs in Eu-clidean space such that the embedding is low-dimensional and preserves the global topolog-ical properties of the input graph. Topologyis preserved if a connectivity algorithm, suchas k-nearest neighbors, can easily recover theedges of the input graph from only the coor-dinates of the nodes after embedding. SPEis formulated as a semidefinite program thatlearns a low-rank kernel matrix constrainedby a set of linear inequalities which capturesthe connectivity structure of the input graph.Traditional graph embedding algorithms donot preserve structure according to our def-inition, and thus the resulting visualizationscan be misleading or less informative. SPEprovides significant improvements in termsof visualization and lossless compression ofgraphs, outperforming popular methods suchas spectral embedding and Laplacian eigen-maps. We find that many classical graphsand networks can be properly embedded us-ing only a few dimensions. Furthermore,introducing structure preserving constraintsinto dimensionality reduction algorithms pro-duces more accurate representations of high-dimensional data.

1. Introduction

Graphs are essential for encoding information, andgraph and network data is increasingly abundant infields ranging from computational biology to computervision. Given a graph where vertices and edges rep-resent pairwise interactions between entities (such aslinks between websites, friendships in social networks,or bonds between atoms in molecules), we hope to re-

Appearing in Proceedings of the 26 th International Confer-ence on Machine Learning, Montreal, Canada, 2009. Copy-right 2009 by the author(s)/owner(s).

cover a low-dimensional set of coordinates for each ver-tex that implicitly encodes the graph’s binary connec-tivity.

Graph embedding algorithms place nodes at pointson some surface (e.g. Euclidean space) and connectpoints with an arc if the nodes have an edge betweenthem. One traditional objective of graph embedding isto place points on a surface such that arcs never cross.In this setting, it is well known that any graph canbe embedded in 3-dimensional Euclidean space andplanar graphs can be embedded in 2-dimensional Eu-clidean space. However, there are many uses for graphembedding that do not relate to arc crossing and thusthere exists a suite of embedding algorithms with dif-ferent goals (Chung, 1997; Battista et al., 1999). Onemotivation for embedding graphs is to solve compu-tationally hard problems geometrically. For example,using hyperplanes to separate points after graph em-bedding is useful for efficiently approximating the NP-hard sparsest cut problem (Arora et al., 2004). Thisarticle will focus on graph embedding for visualiza-tion and compression. Given only the connectivityof a graph, can we efficiently recover low-dimensionalpoint coordinates for each node such that these pointscan easily be used to reconstruct the original struc-ture of the network? We are interested in reversiblegraph embeddings (from a graph to points back to agraph). If the graph can be easily reconstructed fromthe points and these require low-dimensionality (andstorage), the method can be useful for both visualiza-tion and compression.

Many embedding algorithms find compact coordinatesfor nodes in a graph. Given an unweighted graphconsisting of N nodes and |E| edges represented asa symmetric adjacency matrix A ∈ {0, 1}N×N spec-ifying which pairs of nodes are connected, spectralembedding finds a set of coordinates for each node~yi ∈ Rd for i = 1, . . . , N by applying singular valuedecomposition (SVD) or principal component analy-sis (PCA) to the adjacency matrix A and using the deigenvectors of A with the largest eigenvalues as thecoordinates. Similarly, Laplacian eigenmaps (Belkin


Spectral Embedding Möbius BandMöbius Ladder Graph

A

Two Spring Embeddings SPE Embedding

Figure 1. Embedding the classical Mobius Ladder Graph. Given the adjacency matrix (left), the visualizations producedby spectral embedding and spring embedding (middle) do not accurately capture the graph topology. The SPE embeddingis compact and topologically correct.

& Niyogi, 2002) employ a spectral decomposition ofthe graph Laplacian L = D − A, or normalized graphLaplacian L = I − D− 1

2AD−12 , where D = diag(A1)

and use the d eigenvectors of L with the smallest non-zero eigenvalues. Unfortunately, in practice there areoften many eligible eigenvectors of A or L, and fur-thermore the resulting coordinates do not preserve thetopology of the input graph exactly. We propose learn-ing a positive semi-definite kernel matrix K ∈ RN×Nwhose spectral decomposition yields a small set ofeigenvectors which preserve the topology of the inputgraph. Specifically, given a connectivity algorithm G(such as k-nearest neighbors, b-matching, or maximumweight spanning tree) which accepts as input a kernelK specifying an embedding and returns an adjacencymatrix, we call an embedding structure preserving ifthe application of G to K exactly reproduces the in-put graph: G(K) = A. This article proposes SPE,an efficient convex optimization based on semidefiniteprogramming for finding an embedding K such thatK is both low-rank and structure preserving.

Traditional graph embedding algorithms such as spec-tral embedding and spring embedding do not explicitlypreserve structure according to our definition and thusin practice produce poor visualizations of many sim-ple classical graphs. In Figure 1 we see the classicalMobius ladder graph and the resulting visualizationsfrom the two methods. The spectral embedding looksdegenerate and does not resemble the Mobius band inany regard. The eigenspectrum indicates that the em-bedding is 6-dimensional when we expect to be able toembed this graph using fewer dimensions. Two springembeddings are shown. The left spring embedding isa good diagram of what the graph should look like;however, we see that the twist of the Mobius stripis not accurately captured. Given the coordinates inEuclidean space produced by this method, any sim-ple neighbor-finding algorithm G would connect nodesalong the red dotted lines, not the blue ones specifiedby the connectivity matrix. Thus, the inherent con-nectivity of the embedding disagrees with the actualconnectivity of the graph. The spring embedding on

the right shows a typical result when a poor randominitialization is used. Due to local minima in springembedding algorithms, the graph is no longer visuallyrecognizable or accurate. We are motivated to finda simple tool for properly visualizing graphs such asthe Mobius ladder, as well as large complex networkdatasets. The tool should be accurate and should cir-cumvent local minima issues.

Structure preserving constraints can also benefit di-mensionality reduction algorithms. These methodssimilarly find compact coordinates that preserve cer-tain properties of the input data. Multidimensionalscaling preserves distances between data points (Cox& M.Cox, 1994). Nonlinear manifold learning algo-rithms preserve local distances described by a graphon the data (Tenenbaum et al., 2000; Roweis & Saul,2000; Weinberger et al., 2005). For these algorithmsthe input consists of high-dimensional points as wellas binary connectivity. Many of these manifold learn-ing techniques preserve local distances but not graphtopology. We show that adding explicit topologicalconstraints to these existing algorithms is crucial forpreventing folding and collapsing problems that occurin dimensionality reduction.

The rest of the article is organized as follows. In Sec-tion 2, we introduce the concept of structure preserv-ing constraints and formulate these constraints as a setof linear inequalities for a variety of different connec-tivity algorithms. We then derive an objective func-tion in Section 3 which favors low-dimensional embed-dings close to the spectral embedding solution. In Sec-tion 4, we combine the structure preserving constraintsand the objective function into a convex optimizationand propose a semidefinite program for solving it ef-ficiently. We present a variety of experiments on realand synthetic graphs in Section 5 and show improve-ments in terms of visualization quality and level ofcompression. We then briefly explore using SPE to im-prove dimensionality reduction algorithms before con-cluding in Section 7.


2. Preserving Graph Structure

We assume we are given an input graph defined byboth a connectivity matrix A as well as an algorithmG which accepts as input a kernel matrix K, and out-puts a connectivity matrix A = G(K), such that A isclose to the original graph A. In this article, we con-sider several choices for G including k-nearest neigh-bors, epsilon-balls, maximum weight spanning trees,maximum weight generalized matching and other max-imum weight subgraphs. We can evaluate how well theembedding produced by K preserves graph structureby determining how much the input connectivity dif-fers from the connectivity computed directly from thelearned embedding. A simple metric to capture thisdifference is the normalized number of pairwise errors4(A, A) = 1

N2

∑ij |Aij −Aij |. For a variety of differ-

ent classes of graphs and algorithms G, it is possibleto enumerate linear constraints on K to ensure thatthis error or difference is zero. The next subsectionsdescribe several choices for algorithm G and the linearconstraints they entail on the embedding K.

2.1. Nearest Neighbor Graphs

The k-nearest neighbor algorithm (knn) greedily con-nects each node to the k neighbors to which the nodehas shortest distance, where k is an input parameter.Preserving the structure of knn graphs requires enu-merating a small set of linear constraints on K.Definition 1. Define the distance between a pair ofpoints (i, j) with respect to a given positive semidefinitekernel matrix K, as Dij = Kii +Kjj − 2Kij.

Note that the matrix D constructed from elementsDij is a linear function of K. For each node, the dis-tances to all other nodes to which it is not connectedmust be larger than the distance to the furthest con-nected neighbor of that node. This results in the linearconstraints on K: Dij > (1 − Aij) maxm(AimDim).Another common connectivity scheme is to connecteach node to all neighbors which lie within an ep-silon ball of radius ε. Preserving this ε-neighborhoodgraph can be achieved with the linear constraints onK: Dij(Aij − 1

2 ) ≤ ε(Aij − 12 ).

It is trivial to show that if for each node the connecteddistances are less than the unconnected distances (orsome ε), a greedy algorithm will find the exact sameneighbors for that node, and thus the connectivitycomputed from K is exactly A, and 4(G(K), A) = 0.

2.2. Maximum Weight Subgraphs

Maximum weight subgraph algorithms select edgesfrom a weighted graph to produce a subgraph which

has maximal weight (Fremuth-Paeger & Jungnickel,1999).

Definition 2. Given a kernel matrix K, define theweight between two points (i, j) as the negated pairwisedistance between them: Wij = −Dij = −Kii −Kjj +2Kij.

Once again, the matrix W composed of elements Wij

is simply a linear function of K. Given W , a maximumweight subgraph produces a graph with binary adja-cency matrix A which maximizes

∑ij AijWij . For ex-

ample when G is the generalized b-matching algorithm,G finds the connectivity matrix which has maximumweight while enforcing a set of degree constraints bi fori = 1...N as follows: G(K) = arg maxA

∑ijWijAij

s.t.∑j Aij = bi, Aij = Aji, Aii = 0, Aij ∈ {0, 1}.

Similarly, a maximum weight spanning tree algorithmG returns the maximum weight subgraph G(K) suchthat G(K) ∈ T , where T is the set of all tree graphs:G(K) = arg maxA

∑ijWijAij s.t. A ∈ T . Unfortu-

nately, for both these algorithms the linear constraintson K (or W ) to preserve the structure in the originaladjacency matrix A such that A = A cannot be enu-merated with a small finite set of linear inequalities;in fact, there can be an exponential number of con-straints of the form:

∑ijWijAij ≥

∑ijWijAij . How-

ever, in Section 4 we present a cutting plane approachsuch that the exponential enumeration is avoided andthe most violated inequalities are introduced sequen-tially. It has been shown that cutting-plane optimiza-tions such as this converge and perform well in practice(Finley & Joachims, 2008).

3. A Low-Rank Objective

The previous section showed that it is possible toforce an embedding to preserve the graph structurein a given adjaceny matrix A by introducing a setof linear constraints on the embedding Gram matrixK. To choose a unique K from the admissible set inthe convex hull generated by these linear constraints,we propose an objective function which favors low-dimensional embeddings close to the spectral embed-ding solution. Spectral embedding is the canonical wayof mapping nodes of a graph into points with coor-dinates. Given a symmetric adjacency matrix A, ituses the eigenvectors of A = V ΛV T with the largestk eigenvalues as coordinates of the points. Spectralembedding produces embeddings with few dimensions,since it uses the most dominant eigenvectors first. Sim-ilarly, we are interested in recovering an embedding,or equivalently, a positive semidefinite kernel matrixK � 0 which is low-rank. Consider the following ob-jective function maxK�0 tr(KA) and limit the trace


StructurePreserving Embedding

(SPE)

Spectral Embedding

Tesseract Celmins Swart SnarkMöbius Ladder Balaban 10-cage

Figure 2. Classical graphs embedded with spectral embedding (above), and SPE w/ kNN (below). Eigenspectra are shownto the right. SPE finds a small number of dimensions that highlight many of the symmetries of these graphs.

norm of K to avoid the objective function from grow-ing unboundedly. We claim that this objective func-tion attempts to recover a low-rank version of spectralembedding.

Lemma 1. The objective function maxK�0 tr(KA)subject to tr(K) ≤ 1 recovers a low-rank version ofspectral embedding.

Proof. Rewrite the matrices in terms of the eigende-composition of the positive semidefinite matrix K =UΛUT and the symmetric matrix A = V ΛV T , andinsert into the objective function:

maxK�0

tr(KA) = maxΛ∈L,U∈O

tr(UΛUTV ΛV T )

where L is the set of positive semidefinite diagonalmatrices and O is the set of orthonormal matrices alsoknown as the Stiefel manifold. By Von Neumann’slemma, we have:

maxU∈O

tr(UΛUTV ΛV T ) = λT λ.

Here λ is a vector containing the diagonal entries ofΛ in decreasing order λ1 ≥ λ2 ≥ . . . ≥ λn and λ is avector containing the diagonal entries of Λ in decreas-ing order, i.e. λ1 ≥ λ2 ≥ . . . ≥ λn. Therefore, the fulloptimization problem can be rewritten in terms of anoptimization over non-negative eigenvalues

maxK�0,tr(K)≤1

tr(KA) = maxλ≥0,λT 1≤1

λT λ

where we also have to satisfy λT1 ≤ 1 since tr(K) ≤ 1.To maximize the objective function λT λ we simply setλ1 = 1 and the remaining λi = 0 for i = 2, . . . , n whichproduces the maximum λ1, the top eigenvalue of thespectral embedding

maxK�0,tr(K)≤1

tr(KA) = λ1.

Thus, the maximization problem reproduces spectralembedding while pushing all the spectrum weight into

the top eigenvalue. The (rank 1) solution must beK = vvT where v is the leading eigenvector of A. Ifthere are ties in A for its top eigenvalues, K is a coniccombination of the top eigenvectors of A which is alow rank solution and has rank at most equal to themultiplicity of the top eigenvalue of A.

In summary, this objective function will attempt tomimic the traditional spectral embedding of a graph.By combining this objective with the linear constraintsfrom the previous section, it will be possible to alsocorrect the embedding such that the graph structurein A is preserved and can be reconstructed from theembedding by using a connectivity algorithm G.

4. Algorithm

In this section, the convex objective function is com-bined with the linear constraints implied by the con-nectivity algorithm G, be it knn, maximum weight b-matching or a maximum weight spanning tree. Allshare the same objective function and some commonconstraints K = {K � 0, tr(K) ≤ 1,

∑ij Kij = 0, ξ ≥

0} to limit the trace norm of K and to center K suchthat the embeddings are centered on the origin. Theobjective function becomes maxK∈K tr(KA) − Cξ.Note that the function involves an additional term Cξwhich uses a manually specified parameter C. Thisallows some violations of the constraints on K givenby the choice of algorithm G, as shown in Figure 3.All linear inequalities encountered from structure pre-serving constraints are slackened with a large C weightto allow a few violations if necessary. The use of a fi-nite C assures a solution to the SDP is always possibleand helps avoid numerical problems and also encour-ages faster convergence of cutting plane methods asdiscussed in (Finley & Joachims, 2008).

For graphs created from greedy algorithms such as k-nearest neighbors, Table 1 outlines the correspondingSPE procedure.


Ring Graph w/ noise C = 1000 C = 5 C ≈ 2 C = 0Figure 3. By adjusting the input paramter C, SPE is able to handle noisy graphs. From left to right, we see a perfectring graph embedded by SPE, a noisy line added to the graph at random , and then the results of using SPE on the noisygraph with C roughly set to 1000, 5, 2, and 1. Note when C is small, SPE reproduces the rank-1 spectral embedding.

Table 1. Structure Preserving Embedding algorithm for k-nearest neighbor constraints.

Input A ∈ BN×N , connectivity algorithm G,and parameter C.

Step 1 Solve SDP K = arg maxK∈K tr(KA)− Cξs.t. Dij > (1−Aij) maxm(AimDim)− ξ

Step 2 Apply SVD to K and use the topeigenvectors as embedding coordinates

For maximum weight subgraph constraints, there of-ten exists an exponential number of constraints of theform:

tr(WA)− tr(WA) ≥ 4(A,A)− ξ ∀A ∈ G

where 4(A, A) = 1N2

∑ij |Aij−Aij |, and A ∈ G states

that A is in the family of graphs formed by a con-nectivity algorithm G, such as b-matchings or trees.To avoid enumerating the exponential number of con-straints, we start by running the optimization withoutany structure preserving constraints and then add themost violated constraint at each iteration. Given alearned kernel K from the previous iteration, we findthe most violated constraint by computing the con-nectivity A that maximizes tr(WA) s.t. A ∈ G using amaximum weight subgraph method. We then add theconstraint to our optimization (tr(WA) − tr(WA)) ≥4(A,A) − ξ. The first iteration yields a rank-1 so-lution, which typically violates many constraints, butafter several iterations, the algorithm converges when|tr(WA)−tr(WA)| ≤ ε, where ε is an input parameter.Table 2 summarizes the cutting-plane version of SPE.

SPE is implemented in MATLAB as a SemidefiniteProgram using CSDP and SDP-LR (Burer & Mon-teiro, 2003) and has complexity similar to other dimen-sionality reduction SDPs such as Semidefinite Embed-ding. The complexity is O(N3+C3) (Weinberger et al.,2005) where C denotes the number of constraints (forthe k-nearest neighbor constraints we typically haveC ∝ |E|). However, in practice many constraints areinactive and working set methods (now common prac-tice for SDPs) perform well. Furthermore, SDP-LRdirectly exploits the low-rank properties of our objec-tive function. We have run SPE on graphs with overthousands of nodes and tens of thousands of edges. For

Table 2. Structure Preserving Embedding algorithm withcutting-plane constraints.

Input A ∈ BN×N , connectivity algorithm G,and parameters C, ε.

Step 1 Solve SDPK = arg maxK∈K tr(KA)− Cξ.

Step 2 Use G, K to find biggest violatorA = arg maxA tr(WA).

Step 3 If |tr(WA)− tr(WA)| > ε, add constrainttr(WA)− tr(WA) ≥ 4(A,A)− ξ and goto Step 1

Step 4 Apply SVD to K and use the topeigenvectors as embedding coordinates

the cutting plane formulation, we add constraints it-eratively, and it can be shown that the algorithm willconverge in polynomial time for quadratic programswith linear constraints (Finley & Joachims, 2008).Since we have a semidefinite program, these guaranteesdo not carry over immediately, although in practicethe cutting plane algorithm works well and has alsobeen successfully deployed in settings beyond struc-tured prediction and quadratic programming (Sontag& Jaakkola, 2008).

5. Experiments

We present visualization results on a variety of syn-thetic and real-world datasets, highlighting the im-provements of SPE over purely spectral methods. Fig-ure 2 shows a variety of classical graphs visualized byspectral embedding and SPE. Note that spectral em-bedding typically finds many eigenvectors with dom-inant eigenvalues, and thus needs many more coordi-nates for accurate visualization, as compared to SPEwhich finds compact and accurate embeddings. Fig-ure 4 shows an embedding of two organic compounds.The true physical embedding in 3D space is shown onthe left. Given only connectivity information SPE isable to produce coordinates for each atom that bet-ter resemble the true physical coordinates. Figure 5shows a visualization of 981 political blogs (Adamic &Glance, 2005). The eigenspectrum shown next to eachembedding reveals that both spectral embedding and


Laplacian Eigenmaps

Structure PreservingEmbedding (SPE)Spectral Embedding

Molecule TR012

Molecule TR015 NormalizedLaplacian

Eigenmaps

Figure 4. Two comparisons of molecule embeddings (top row and bottom). The SPE w/kNN embedding (right) moreclosely resembles the true physical embedding of the molecule (left), despite being given only connectivity information.

Laplacian eigenmaps find many possible dimensionsfor visualization, whereas SPE requires far fewer di-mensions to accurately represent the structure of thedata. Also, note that the embedding that uses thegraph Laplacian is dominated by the degree distribu-tion of the network. Nodes with high degree requireneighbors to be very far away. The SPE embeddingwas created using b-matching constraints and repre-sents the network quite compactly, finding 2 dominantdimensions which produce a beautiful visualization ofthe link structure as well as yield the lowest error onthe adjacency matrix created from the low-dimensionalembedding.

6. Dimensionality Reduction

SPE is not explicitly intended as a dimensionality re-duction tool, but rather a tool for embedding graphs infew dimensions. However, these two goals are closelyrelated; many nonlinear dimensionality reduction al-gorithms, such as Locally Linear Embedding (LLE),Maximum Variance Unfolding (MVU), and MinimumVolume Embedding (MVE) begin by finding a sparseconnectivity matrix A that describes local pairwise dis-tances (Roweis & Saul, 2000; Weinberger et al., 2005;Shaw & Jebara, 2007). These methods assume thatthe data lies on a low-dimensional nonlinear mani-fold embedded in a high-dimensional ambient space.To recover the manifold, these methods preserve localdistances along edges specified by A in the hope thatthey mimic true geodesic distances measured along themanifold as opposed to measured through arbitrary di-rections in high-dimensional space. These pairwise dis-tances and the connectivity matrix describe a weightedgraph. However, preserving distances does not explic-itly preserve the structure of this graph. Algorithms

such as LLE, MVU, and MVE, produce embeddingswhose resulting connectivity no longer matches theinherent connectivity of the data. The key problemarises from unconnected nodes. The distances betweennodes that are connected are preserved; however, dis-tances between unconnected nodes are free to vary,and thus can drastically change the graph’s topology.Manifold unfolding algorithms may actually collapseparts of the manifold onto each other inadvertentlyand break the original connectivity structure, as illus-trated in the toy problem in Figure 6. Therein, thedistances are preserved yet MVU folds the graph andmoves points in such a way that they are closer toother points that were originally not neighbors. Byincorporating SPE’s constraints, these algorithms canglobally preserve the input graph exactly, not just itslocal properties.

Original MVU

Figure 6. An example of maximum variance unfolding(MVU) folding the input graph. Because MVU does notexplicitly preserve structure, MVU collapses the diamondshape in the middle to a single point. Explicit structurepreserving constraints would keep the original solution.

We present an adaptation of MVU called MVU+SP,that simply adds the kNN structure preserving con-straints to the MVU semidefinite program. Similarly,we present an adaptation of the MVE algorithm calledMVE+SP that adds the kNN structure preserving con-straints to the MVE semidefinite program (Shaw & Je-bara, 2007). Exactly preserving a k-nearest neighborgraph on the data during embedding means a k-nearest


Structure Preserving Embedding (SPE)Spectral Embedding

NormalizedLaplacian Eigenmaps

2.854%2.971% 9.281%

Figure 5. The link structure of 981 political blogs. Conservative are labeled red, and liberal blue. Also shown is the %error of the adjacency matrix created from the 2D embedding, and the resulting eigenspectrum from each method. Notshown here is un-normalized Laplacian eigenmaps, whose embedding is similar to the normalized case and reconstructionerror is 55.775%. Note that SPE is able to achieve the lowest error rate, representing the data using only 2 dimensions.

neighbor classifier will perform equally well on the re-covered low-dimensional embedding K as it did onthe original high-dimensional dataset. Because struc-ture is not explicitly preserved with many existing di-mensionality reduction techniques, the neighborhoodof each point can drastically change, thus potentiallyreducing the accuracy of the knn classifier. These er-rors indicate that the manifold has been folded ontoitself during dimensionality reduction. Table 3 showsthat MVE+SP offers a significant advantage over othermethods in terms of accuracy of using a 1-nearestneighbor classifier on the resulting 2D embeddings.

The results in Table 3 were obtained by sampling avariety of small graphs from UCI datasets (Asuncion& Newman, 2007) by randomly selecting 120 nodesfrom the two largest classes of each dataset. Then,the data was embedded with the the variety of algo-rithms above and the resulting performance of a 1NNclassifier was measured. The data was split 60/20/20into training/cross validation/testing groups. The ac-curacies shown in Table 3 show the results of averag-ing 50 of these folds. Cross-validation was used tofind the optimal value of k for each algorithm anddataset, where k specifies the number of nearest neigh-bors each node connects to. The All-Dimension col-umn on the right shows the corresponding accuracyof a 1-nearest-neighor classifier using all of the fea-tures of the data. MVE+SP performs better than allother low-dimensional methods, and for the datasetsIonosphere, Ecoli, and OptDigits, MVE+SP is moreaccurate than using all the features of the data eventhough MVE+SP uses only 2 features per datapoint.It appears that the manifold recovered by MVE+SPdenoises the data, placing the key modes of variation in

the top few dimensions, while disregarding other noisethat reduces accuracy for the all-dimensions case.

Clearly MVE+SP is not intended to compete withsupervised algorithms for classification since it oper-ates agnostically without any knowledge of the labelsfor the points. However, Table 3 highlights a com-mon deficiency with many dimensionality reductionalgorithms. Often, if dimensionality reduction is usedin an unsupervised setting as a preprocessing step, itcan reduce performance and a classifier formed fromthe low-dimensional data can perform worse. Sim-ply preserving distances during dimensionality reduc-tion is not enough to preserve accuracy for a sub-sequent classification problem. This is not the casefor MVE+SP which is maintaining (or even helping)classification rates, further reinforcing the strength ofthe structure preserving constraints to create accuratelow-dimensional representations of data (without anyknowledge of labels or a classification task). Further-more, these results confirm a recent finding regardingthe finite-sample risk for a k-nearest neighbor clas-sifer (Snapp & Venkatesh, 1998) which show that thefinite-sample risk can be expressed asymptotically as:Rm = R∞ +

∑kj=2 cjn

−j/d + O(n−(k+1)/d), n → ∞,where n is the number of finite samples, d is the dimen-sionality of the data, k is the number of neighbors inthe nearest neighbor classifier and c2, ..., ck are scalarcoefficients specified in (Snapp & Venkatesh, 1998).The main theorem of (Snapp & Venkatesh, 1998) sug-gests that this expansion validates the curse of dimen-sionality and, in order, to maintain |Rn −R∞| < ε forsome ε > 0 the sample size must scale with ε accordingto n ∝ ε−d/2. Thus, if the k-nearest neighbor struc-ture of the data is preserved exactly while dimension-


Table 3. Average classification accuracy of a 1-nearest neighbor classifier on UCI datasets. MVE+SP has higher accuracythan other low-dimensional methods and also beats All-dimensions on Ionosphere, Ecoli, and OptDigits. Other class pairsfor OptDigits are not shown since all methods achieved near 100% accuracy.

KPCA MVU MVE MVE+SP All-DimensionsIonosphere 66.0% 85.0% 81.2% 87.1% 78.8%Cars 66.1% 70.1% 71.6% 78.1% 79.3%Dermatology 58.8% 63.6% 64.8% 66.3% 76.3%Ecoli 94.9% 95.6% 94.8% 96.0% 95.6%Wine 68.0% 68.5% 68.3% 69.7% 71.5%OptDigits 4 vs. 9 94.4% 99.2% 99.6% 99.8% 98.6%

ality d is reduced (for instance using the approach ofMVE+SP), the finite-sample risk should more quicklyapproach the infinite sample risk. This makes it is pos-sible to more accurately use training performance onlow-dimensional k-nearest neighbor graphs to estimatetest performance.

7. Conclusion

This article suggests that preserving local distances orusing spectral methods is insufficient for faithfully em-bedding graphs in low-dimensional space, and demon-strates the improvements of SPE over current graphembedding algorithms in terms of both the quality ofthe resulting visualizations as well as the amount ofinformation compression. SPE allows us to accuratelyvisualize many interesting network structures rangingfrom classical graphs, to organic compounds, to thelink structure between websites, using relatively fewdimensions. Furthermore, by incorporating structurepreserving constraints into existing nonlinear dimen-sionality reduction algorithms, these methods can ex-plicitly preserve graph topology in addition to localdistances, and produce more accurate low-dimensionalembeddings.

References

Adamic, L. A., & Glance, N. (2005). The politicalblogosphere and the 2004 US election. WWW-2005Workshop on the Weblogging Ecosystem.

Arora, S., Rao, S., & Vazirani, U. (2004). Expanderflows, geometric embeddings and graph partitioning.Symposium on Theory of Computing.

Asuncion, A., & Newman, D. (2007). UCI ma-chine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html.

Battista, G. D., Eades, P., Tamassia, R., & Tollis, I.(1999). Graph drawing: algorithms for the visual-ization of graphs. Prentice Hall.

Belkin, M., & Niyogi, P. (2002). Laplacian eigenmapsfor dimensionality reduction and data representa-tion. Neural Computation, 15, 1373–1396.

Burer, S., & Monteiro, R. D. C. (2003). A nonlin-ear programming algorithm for solving semidefiniteprograms via low-rank factorization. MathematicalProgramming (series B), 95(2), 329–357.

Chung, F. R. K. (1997). Spectral graph theory. Amer-ican Mathematical Society.

Cox, T., & M.Cox (1994). Multidimensional scaling.Chapman & Hall.

Finley, T., & Joachims, T. (2008). Training structuralsvms when exact inference is intractable. Proc. of25th International Conference on Machine Learning(pp. 304–311).

Fremuth-Paeger, C., & Jungnickel, D. (1999). Bal-anced network flows, a unifying framework for de-sign and analysis of matching algorithms. Networks,33, 1–28.

Roweis, S., & Saul, L. (2000). Nonlinear dimensional-ity reduction by locally linear embedding. Science,290, 2323–2326.

Shaw, B., & Jebara, T. (2007). Minimum volumeembedding. Proc. of the 11th International Con-ference on Artificial Intelligence and Statistics (pp.460–467).

Snapp, R., & Venkatesh, S. (1998). Asymptotic ex-pansions of the k nearest neighbor risk. The Annalsof Statistics, 26, 850–878.

Sontag, D., & Jaakkola, T. (2008). New outer boundson the marginal polytope. Advances in Neural In-formation Processing Systems 20 (pp. 1393–1400).

Tenenbaum, J., de Silva, V., & Langford, J. (2000).A global geometric framework for nonlinear dimen-sionality reduction. Science, 290, 2319–2323.

Weinberger, K. Q., Packer, B. D., & Saul, L. K. (2005).Nonlinear dimensionality reduction by semidefiniteprogramming and kernel matrix factorization. Proc.of the of the 10th International Workshop on Arti-ficial Intelligence and Statistics (pp. 381–388).

Structure Preserving Embedding - Columbia Universityjebara/papers/spe-icml09.pdfStructure Preserving...

Documents

Transcript of Structure Preserving Embedding - Columbia Universityjebara/papers/spe-icml09.pdfStructure Preserving...