How deep learning works — The geometry of deep … deep learning works — The geometry of deep...

1

How deep learning works — The geometry of deep learning

Xiao Dong, Jiasong Wu, Ling ZhouFaculty of Computer Science and Engineering, Southeast University, Nanjing, China

Why and how that deep learning works well on different tasks remains a mystery from a theoretical perspective. In this paperwe draw a geometric picture of the deep learning system by finding its analogies with two existing geometric structures, thegeometry of quantum computations and the geometry of the diffeomorphic template matching. In this framework, we give thegeometric structures of different deep learning systems including convolutional neural networks, residual networks, recursive neuralnetworks, recurrent neural networks and the equilibrium prapagation framework. We can also analysis the relationship betweenthe geometrical structures and their performance of different networks in an algorithmic level so that the geometric frameworkmay guide the design of the structures and algorithms of deep learning systems.

Index Terms—Deep learning, geometry, quantum computation, computational anatomy

I. INTRODUCTION

In the last decade, deep learning systems show a fascinatingperformance in solving different complex tasks. We havedesigned different system structures for different problems andrevealed some general rules for the designing of deep learningsystems. There are also theoretical attempts to understanddeep learning systems from both mathematical and physicalperspectives[1]. But still we are lacking of a theoreticalframework to answer the question, why and how deep learningsystems works. Also it’s highly desired that a theoreticalframework of deep learning systems can be used to guidethe design the structures of deep learning systems from analgorithmic level.

In this paper we try to fill this gap by drawing a geometricpicture of deep learning systems. We build our geometricframework to understand deep learning systems by comparingthe deep learning system with other two existing maturalgeometric structures, the geometry of quantum computationsand the geometry of diffeomorphic template matching in thefield of computational anatomy. We show that deep learningsystems can be formulated in a geometric language, by whichwe can draw geometric pictures of different deep learningsystems including convolutional neural networks, residualnetworks, recursive neural networks, fractal neural networksand recurrent neural networks. What’s more, these geometricpictures can be used to analysis the performance of differentdeep learning structures and provide guidance to the design ofdeep learning systems from an algorithmic level.

The rest of this paper is arranged as follows. We willfirst give a brief overview of the geometry of quantumcomputations and the geometry of diffeomorphic templatematching. Then we will explain the geometric frameworkof deep learning systems and apply our framework to drawcorrespondent geometric pictures of different deep learningnetworks. Finally we will give a general optimization basedframework of deep learning systems, which can be used toaddress the equilibrium propagation algorithm.

II. WHAT CAN GEOMETRY TEACH US

It’s well known that geometry is not only a core concept ofmathematics, but also it plays a key role in modern physics.

The great success of geometrization of physics tells us thatthe soul of physical systems lies in their geometric structures.It’s natural to ask if geometry can help to reveal the secretof our human intelligence and our state-of-the-art artificialintelligence systems, deep learning systems. The answer isYES.

We will first introduce two interesting geometric structuresin the fields of quantum computation and computationalanatomy. We will see, their geometric structures share somesimilarities with deep learning systems and the geometricframework of deep learning systems can be built based onan understanding of these structures.

A. Geometry of quantum computation

Geometry concepts have been widely discussed in formulat-ing quantum mechanics and quantum information processingsystems, including the geometry of quantum states and theirevolution[2][3][4], the geometry of entanglement[5][6] andalso the relationship between the spacetime structure and thegeometry of quantum states[7][8][9].

In [10][11] a geometric framework was proposed for thecomplexity of quantum computations. Its basic idea is tointroduce a Riemannian metric to the space of n-qubit uni-tary operators so that the quantum computation complexitybecomes a geometric concept as given by the slogan quantumcomputation as free falling. The key components of thisframework can be summarized as follows.• An algorithm of a n-qubit quantum computation system

is a unitary operation U ∈ U(2n), which can evolve then-qubit initial state |ψini〉 = |00...0〉 to the final state|ψfin〉 = U |ψ〉ini. U(2n) is the space of the unitaryoperations of n-qubits, which is both a manifold and aLie group.

• Any physical realization of the algorithm U is a curveU(t) ∈ U(2n), t ∈ [0, 1] with U(0) = I, U(1) = U ,where I is the identity operator. A smooth curve U(t)can be achieved by a Hamiltonian H(t) so that U(t) =−iH(t)U(t), where H(t) is in the tangent space ofU(2n) at U(t) and also an element of the Lie algebrau(2n).

arX

iv:1

710.

1078

4v1

[cs

.LG

] 3

0 O

ct 2

017

2

• A Riemannian metric 〈·, ·〉U can be introduced to themanifold U(2n) so that a Riemannian structure can bebuilt on U(2n). The metric is defined as

〈H,J〉U = [tr(HP(J)) + qtr(HQ(J))]/2n (1)

where J, P are tangent vectors in the tangent space atU , and P(J) is the projection of J to a simple localHamiltonian space P , Q = 1−P and q >> 1. With thismetric any finite length curve U(t) connecting the identityI and U will only contain local Hamiltonian when q →∞. Usually the geodesic Ugeo(t) connecting I and U hasthe minimal length, which is defined as the complexityof the algorithm U . We also know the geodesic is givenby the EPDiff equation, which is obtained by solving theoptimization problem minH(t)

∫〈H(t), H(t)〉1/2U dt.

• The quantum circuit model of quantum computations tellsus that any U can be approximated with any accuracywith a small universal gate set, which contains onlysimple local operators, for example only a finite numberof 1 and 2 qubit operators. The quantum circuit modelcan then be understood as approximating U(t) with apiecewise smooth curve, where each segment correspondsto a universal gate. It’s proven in [10][11] that the numberof universal gates to approximate U within some constanterror is given by O(ntd(I, U)3), where d(I, U) is thegeodesic distance between I and U . So an algorithm Ucan be efficiently realized by the quantum circuit modelonly when its geodesic distance to the identity operatorI is polynomial with respect to n.

• The set of operators that can be reached by a geodesicwith a length that is polynomial with the qubit numbern is only a small subset of U(2n). Equivalently, startingfrom a simple initial state |ψ〉ini = |000...0〉, the quantumstate that can be obtained efficiently by local quantumoperations is a small subset of the quantum state space.This defines the complexity of a quantum state.

• Finding the geodesic Ugeo(t) can be achieved by eitherthe lifted Jacobi field or standard geodesic shooting al-gorithm. The performance of the optimization procedureto find Ugeo(t) highly depends on the curvature of theRiemannian manifold. For the metric given in (1), thesectional curvature is almost negative everywhere whenq is large enough. This means the geodesic is unstableand it’s general difficult to find a geodesic.

The key lesson of the geometry of quantum computationsis that: Under a local Hamiltonian preferred metric, thecomplexity of a general quantum algorithm U is expo-nential with the number of qubit n. If the complexityU is polynomial, it can be achieved efficiently with localoperators, either by a continuous local Hamiltonian H(t)or a set of universal local operators. The performance ofthe optimization procedure to find a good realization of Udepends on the curvature of the Riemannian manifold.

We will see that this observation is closely related with thequestion ofwhy and how cheap deep learning works.

Fig. 1. Geometry of the quantum computation. The yellow area gives thespace of quantum algorithms that can be efficiently achieved by the quantumcircuit model. Given a Riemannian structure on U(2n), there exist an arbitrarynumber of ways to achieve an algorithm U and its complexity is given by thelength of the geodesic connecting I and U . Any efficient realization curveU(t) can approximated by an universal gate set with a polynomial cost asshown by the red segmented line.

B. Geometry of diffeomorphic template matching

The diffeomorphic framework of computational anatomyaims to achieve a diffeomorphic matching between imageson a Riemannian structure or a Lie group. The geometry ofdiffeomorphic template matching not only provides a math-ematical framework for image registration but also validatesthe possibility of statistical analysis of anatomical structureson manifolds[12][13][14][15][16][17][18].

We can summarize the geometry of diffeomorphic templatematching as follows:• The Riemannian geometry of diffeomorphic template

matching starts with the claim that a diffeomorphicmatching between two images I0 and I1 defined on aspace M = Rn (n = 2 for 2 dimensional images orn = 3 for 3 dimensional volume data) is achieved by adiffeomorphic transform g ∈ Diff(M). Diff(M) thediffeomorphism transformation group of the space M .

• A matching between I0 and I1 is achieved by a smoothcurve g(t) on Diff(M) connecting the identity trans-form I and g. The optimal g(t) is obtained by anoptimization problem

minu(t)

E = minu(t)

∫ 1

0

〈u(t), u(t)〉vdt+1

σ2‖I1 − I0 ◦ g1‖2L2

(2)where ˙g(t) = u(t) ◦ g(t), g(0) = I and 〈u(t), u(t)〉v =〈Lu(t), Lu(t)〉L2 is a metric on the tangent space of themanifold Diff(M) with L a linear operator.This is the LDDMM framework[12] and the optimizationresults in the Euler-Lagrange equation given by

∇u(t)E = 2ut −K(2

σ2|Dgutt,1|∇J0

t (J0t − J1

t )) = 0 (3)

where J0t = I0 ◦ gutt,0,J1

t = I1 ◦ gutt,1 and guts,t is thetransformation generated by u(t) during the time internal

3

Fig. 2. The geometry of the diffeomorphic template matching. Left: Adiffeomorphic matching is a curve g(t) in Diff(M), the space of thediffeomorphic transformationg of the image; Right: The image defined onM is smoothly deformed from I0 to a near neighbour of I1 by the curveg(t).

s to t. K is the operator defined by 〈a, b〉L2 = 〈Ka, b〉v ,i.e., K(L+L) = 1. In computational anatomy K isusually taken as a Gaussian kernel, whose effect is togenerate a smooth velocity field u(t) that leads to asmooth transformation g(t).The geometric picture of LDDMM is to find a geodesicg(t) that transforms I0 = Io ◦ g(0) to a near neighbourof I1 so that I1 ≈ I0 ◦ g(1). The optimal velocity fieldu(t) is obtained with a gradient descent algorithm.

• An alternative approach is the geodesic shootingalgorithm[19][20][21], which is essentially a variationalproblem on the initial velocity u(0) of the geodesic g(t)and the EPDiff equation is taken as a constraint.

• The sectional curvature of the volume preserving dif-feomorphisms of the flat torus with a weak L2 rightinvariant metric is negative in may directions so thegeodesic on this manifold can be unstable. We indicatethis point to show that here we might also face a similarnegative curvature case as in the geometry of the quantumcomputation.

• If instead we do not introduce the metric 〈u(t), u(t)〉v inDiff(M), Diff(M) can be regarded as a Lie group.Then an alternative matching curve g(t) can be formu-lated as a Lie group exponential map exp(gt), t ∈ [0, 1]and g ∈ diff(M) with diff(M) as the Lie algebraof Diff(M). This is how the SVF framework[13][14]formulates the template matching problem.

• There exists a mathematically strict framework for thegeometry of template matching which shares lots of sim-ilarities with mechanical systems. For more informationof the geometry of template matching, please refer to[14][16][17][18][22].

The key concept of the geometry of template matching is:The template matching can be achieved by either a Rie-mannian geodesic or a Lie group exponential mapping inDiff(M) that transforms the source image to approximatethe destination image. Finding the optimal curve can beformulated as an optimization problem. The Riemannianmetric or the Lie group structure of Diff(M) play acentral role in the optimization procedure.

Obviously the two geometric structures share lots of com-mon characteristics. Both systems consider the geometry of

Fig. 3. The geometry of the diffeomorphic computational anatomy. Left:The geometry of the template matching on Riemannian manifolds, wherethe optimal matching is achieved by a geodesic; Right: The geometry of thetemplate matching on a Lie group, where the optimal matching is given bythe Lie group exponential mapping.

a mapping T × V → V , where V is a vector space (thestate space of n-qubits or the space of images on M = Rn

respectively) and T is the space of transformations onV (SU(2n) or Diff(M)). By introducing a Riemannianmetric on T and an inner product on V , both of themaddress the problem of finding a curve on a Riemannianmanifold (SU(2n) or Diff(M)) to connect the identityoperator with a target operator(U or g). Equivalently thiscurve can achieve a transformation from an initial state(|ψ〉ini or I0) to a near neighbourhood (defined by theinner product on V ) of a final state (|ψ〉fin or I1) in V .The problem can be formalized as an optimization task andthe optimal curve is a Riemannian geodesic (an Lie groupexponential for SVF). The stability of geodesics and theconvergence property of the optimization procedure highlydepend on the Riemannian metric of the Riemannianmanifold. In both geometric structure, we have to factthe negative sectional curvature problem.

Besides their similarities, there exist a key difference be-tween these two geometric structures. The geometry of thequantum computation addresses the problem to reach a pointefficiently by local operations. So it focuses on the com-putability of a physical system. For more discussion on thecomputability and physical systems, please refer to [10]. Onthe contrary, in the template matching problem, we do notemphasis to use local operations. The reason falls in thatthe deformation curve g(t) is in fact generated purely bylocal operations since it’s an integration curve of a time-varying velocity field, which is essentially local. Also in thetemplate matching problem, it seems that we do not care if atransformation g ∈ Diff(M) can be efficiently computed.

The reason for the difference between them is related to theconcept of the diameter of a Riemannian manifold, which isdefined by the maximal geodesic distance between arbitrarypoints on the manifold. So for the geometry of the quantumcomputation, the manifold has an exponentially large diameterwith respect to the qubit number n. So only a subset of theoperations can be efficiently carried out. For the templatematching problem, the diameter is usually finite. In fact thereal situation is quite complex. For example, if we work onthe volume-preserving diffeomorphism group SDiff(Mn) ofa n-dimensional cube Mn in Rn, the diameter is finite for

4

n ≥ 3 but infinite for n = 2 if a L2 right-invariant metric isused. For Diff(Mn), a H1 metric results in a finite diameter,and a L2 metric can even lead to a vanishing geodesic distance,which means any two points can be connected with a geodesicwith an arbitrary small distance. So generally we assume allthe transformations in the template matching problem can beefficiently computed.

We will see that the diameter of the Riemannian manifold ofthe deep learning systems is closed related with the questionwhy deep learning works.

III. GEOMETRY OF DEEP LEARNING

Based on the above mentioned geometric structures, wepropose a geometric framework to understand deep learningsystems. We will first focus on the most popular deep learningstructure, the deep convolution neural network(CNN). Thenthe same framework can be used to draw geometric picturesof other systems including the Residual Network(ResNet), therecursive neural network, the fractal neural network and therecurrent network.

A. Geometry of convolutional neural networks

The convolutional neural network is the most well studieddeep learning structure, which can achieve reliable objectrecognition in computer vision systems by stacking mul-tiple CONV-ReLU-Pooling operations followed by a fullyconnected network. From a geometric point of view, CNNachieves the object classification by performing a nonlineartransformation UCNN from the input image space, where theobject classification is difficult, to the output vector space, inwhich the classification is much easier. We will show that thereexists an exact correspondence between CNN and the geom-etry of the quantum circuit model of quantum computation,where UCNN corresponds to the unitary operator U of thequantum computation and the CONV-ReLU-Pooling operationplays the role of local universal gates.

The geometric structure of CNN is constructed as follows.• The manifold

We first construct the mapping TCNN×VCNN → VCNN .In a CNN with L-layers, each layer accomplishes atransformation of the input data. We can easily embed theinput and output data of each step into a Euclidean spaceRn with a high enough dimension n. We call it VCNNand its dimension is determined by the structure of theCNN including the input data size and the kernel numberof each level. Accordingly TCNN can be taken as theautomorphism of VCNN . The goal of CNN is then to findand realize a proper transformation UCNN ∈ TCNN byconstructing a curve UCNN (t) on TCNN , which consistsof L line segments, to reach UCNN .

• The Riemannian metricIt’s easy to see that the local CONV-ReLU-Poolingoperation of each layer corresponds to the local unitaryoperations in the quantum computation system. The de-tails of each CONV-ReLU-Pooling operation define theallowed local operations, i.e., the set P in the quantumcomputation system.

Now we have two different ways to understand the metricon TCNN . We can define a similar metric as (1), whichmeans that at each layer of the CNN, this metric onlyassigns a finite length to the allowed local operationsdefined by the CONV-ReLU-Pooling operation of thislayer. So in this picture the metric changes during theforward propagation along the CNN since the config-uration of the CONV-ReLU-Pooling operator differs ateach layer. So UCMM (t) is a curve on a manifold with atime-varying metric. To remedy our Riemannian structurefrom this complicated situation, we can define anothercomputational complexity oriented metric by scaling themetric at each layer by the complexity of the CONV-ReLU-Pooling operation. For example, if at a certainlayer we have Nk kernels with a kernel size Kx×Ky×Kz

and the input data size is Sx × Sy × Sz , then wescale the metric by a factor NkKxKyKzSxSySz . Withthis metric, UCNN (t) is a curve on TCNN with thiscomplexity oriented metric and the length of UCNN (t)roughly corresponds to the complexity of the CNN, if thenonlinear operations are omitted. Of course, the nonlinearReLU and pooling operation in fact also change themetric. Existing research works showed that they have astrong influence on the metric as will be explained below.

• The curvatureIf the CNN is formulated as a curve on a Riemannianmanifold, then obviously the convergence of the trainingprocedure to find the curve highly depends on the curva-ture of the manifold. It’s difficult to compute explicitlythe curvature of TCNN . But we may have a reasonableguess since we know that the Riemannian manifold of thequantum computation, which has a very similar metric,has a curvature which is almost negative everywhere. Ofcourse the metric of the CNN is more complex since herewe also have the nonlinear component. If TCNN also hasa almost negative curvature under our first definition ofthe metric shown above, then our second scaled metricwill result in a highly curved manifold with a largenegative curvature depending on the size of kernels andthe selection of the activation function. Roughly speaking,a larger kernel size and a stronger nonlinearity of theactivation function lead to a higher negative curvature.

• The target transformation UCNNDifferent with the case of the quantum computation,where the algorithm U is known, here we do not haveUCNN . UCNN is found in a similar way as in the caseof the template matching, where the target transformationg is obtained by an optimization to minimize the errorbetween the transformed source image and the destinationimage ‖I1 − I0 ◦ g1‖2L2 . In CNNs, we find UCNNin a similar way. The only difference is that now theoptimization is carried out on an ensemble of sourceand destination points, i.e., the training data with theircorresponding labeling.

• The curve UCNN (t)UCNN (t) is a piece-wise smooth curve consisting of Lline segments, where the length of each line segmentcorresponds to the computational complexity of each

5

Fig. 4. The geometry of CNNs. Similar with the geometry of quantumcomputation, TCNN is the space of all possible transformations that can berealized by a CNN. The Riemannian metric on TCNN is determined by thedesign of the CONV-ReLU-Pooling operations of the CNN and accordinglythe curvature of the Riemannian manifold will influence the optimizationprocedure during the training phase of CNNs. The yellow area is the subspaceof all the transformations that can be efficiently achieved by CNNs. Anygiven CNN structure corresponds a curve in TCNN , where each line segmentcorresponds to the CONV-ReLU-Pooling operation of each layer and thelength of the line segment is proportional to the computational complexityof this layer. Compared with the geometry of template matching, a CNNdoes not aim to find the optimal curve, i.e. the geodesic, instead it aims tofind a curve with a fixed length L which is proportional with the depth of aCNN.

layer. So geometrically a CNN is to find a curve witha given geometrical structure to connect I and UCNN onTCNN .

The above given Riemannian is built on the transformationspace TCNN . In fact just like the cases in both the quantumcomputation and the template matching, equivalently we canhave another Riemannian manifold on the data space VCNN .We can regard the CNN achieves a change of the metric fromthe input data space to the final output data space defined by〈w, v〉in = 〈U∗w,U∗v〉out, where 〈·, ·〉in and 〈·, ·〉out are themetric on the input and output manifolds respectively and U∗ isthe differential of the transformation U . So in this construction,the metric of the data space changes along the data flow inthe network. As we indicated before, the input data can not beeasily classified in the input data space by defining a simpledistance between points, but it’s much easier in the outputdata space even with an Euclidean distance. This means byapplying the transformation UCNN , the CNN transforms ahighly curved input data space to a flatter output data spaceby a curvature flow like mechanism. This gives an intuitivepicture on how a complex problem can be solved with theCNN by flattening a highly curved manifold as in [23]. Butit’s difficult to estimate how the curvature of the input dataspace except for some special cases as in [10]. In the restof this paper we will focus on the Riemannian structure onTCNN since it’s relatively easier to analysis the curvature ofTCNN .

We now show how the design of the CNN structure influ-ence the geometry and how the geometric structure can affectthe performance of CNNs.

First we need to clarify how the components of CNNs willaffect the geometric structure.

• Fully connected networkUsually there is a fully connected network at the finalstage of CNNs. From existing experimental works weknow that its main function is to collect multiple modes ofa certain category of object. For example, if the task is toidentify cats, the FC network is in charge of collecting theinformation of cats with different locations and gestures.Also we know with the increase of the kernel number anddepth, the FC part can be omitted. In another word, theconvolutional network before the FC network in fact hasalready achieved an over classification of the objects, theFC network only achieve a clustering of the information.So the FC network does not play an important role in thegeometry of CNN.

• Depth of CNNThe depth of a CNN is related to the total length ofthe curve UCNN (t), i.e., the complexity of the CNN.Obviously a deeper CNN can potentially achieve morecomplex transformation in TCNN .

• CONV operationThe size and the number of kernel of the CONV operaionleads to different VCNN and TCNN as described above.Also as a component of the local CONV-ReLU-Poolingoperation, CONV influences the complexity of localoperators and the metric as well. Generally a biggerkernel size and kernel number increase the complexityof allowed local operators. This can be understood if wecompare it with the Riemannian metric defined in thegeometry of quantum computations. Also the CONV ofeach layer changes the structure of the curve UCNN (t).

• ReLU operationThe role of ReLU operation (or a general activationfunction) falls in two folds: Firstly it provides the neces-sary nonlinear component to achieve the target nonlineartransformation UCNN . Secondly different selections ofthe activation function show different nonlinear property,so it also influences the complexity of local operationsand the Riemannian metric. The existing experimentalworks on the influence of the activation function on theconvergence property of CNN indicate that it may have avery strong effect on the Riemannian metric, which leadsto a change of the curvature of the Riemannian manifoldTCNN . This observation helps to analysis the influence ofthe activation function on the convergence of the trainingprocedure.

• Pooling operationThe function of the pooling operation is also two folds:On one hand, it can reduce the complexity by changingVCNN and TCNN . On the other hand it also plays a rolein constructing the metric, correspondent to the kernelK in the LDDMM framework. By using pooling oper-ations, we can improve the smoothness and robustness

6

of the curve UCNN (t) since UCNN is estimated by astatistical optimization using samples from input imagespace, i.e., the training data. The pooling operation canbe understood as an extrapolation of the samples so thatthe information of each sample is propagated to its nearneighbourhood.

Given the above analysis, we can answer the followingquestions from a geometric point of view:

• Why can CNNs work well in computer vision systems?There are some theoretical considerations to answer thisquestion. In [1] it’s claimed that the special structure ofthe image classification task enables the effectiveness ofCNN, where the problem is formulated as the efficientapproximation of a local Hamiltonian system with localoperations of CNNs. By checking the analogy betweenthe CNN and the quantum computation, we see thatthe argument of [1] is equivalent to say that for a taskthat can be solved by a deep CNN, its UCNN falls inthe subset that can be achieved by local CONV-ReLU-Pooling operators, just as the subset of U(2n) that can beapproximated by simple local universal gates in quantumcomputations. So the answer to this question is trivial,CNN works since the problem falls in the set of solvableproblems of CNNs. A direct question is, can we designa task that does not belong to this solvable problem set?A possible design of such a task is to classify imagesthat are segmented into small rectangular blocks andthen the blocks are randomly permuted. Obviously suchimages are generated by global operations which cannot be effectively approximated by local operations, andtherefore the classification of such images can not beachieved efficiently by structures like CNN. So we cansay, the diameter of the Riemannian manifold TCNN isexponentially large with respect to the size of the inputdata and only those problems which can be solved bya transformation UCNN that has a polynomially largegeodesic distance to the identity transformation I can besolved by CNNs.

• Can we justify if a problem can be efficiently solved byCNNs? Generally this is difficult since this is the samequestion as to ask if a unitary operator can be decomposedinto a tensor product of simple local unitary operators, orto ask if a n-qubit quantum state has a polynomial statecomplexity with respect to n. We already know both ofthem are difficult questions.

• Why deep CNN?Geometrically, training a CNN with a given deepth L isequivalent to find a curve of length roughly proportionalto L in TCNN to reach the unknown transformationUCNN . In deep learning systems, this is achieved by agradient descent based optimization procedure, i.e., thetraining of CNNs. It should be noted that the CNN isnot to find a geodesic, instead it aims to find a curvewith a fixed geometric structure, i.e., a curve consistedof multiple line segments with given lengthes determinedby the CNN structure. The reason to use the deep CNNis just that the length of UCNN (t) should be longer

than the length of the geodesic connecting the identitytransformation I and UCNN on the Riemannian manifoldTCNN . Otherwise it will not work since a short UCNN (t)will never reach UCNN . On the other side, it’s notdifficult to guess that a too deep CNN may also fail toconverge if the curve UCNN (t) is too long due to both thedepth L and an improper design of CONV-ReLU-Poolingoperations. This can be easily understood since to reacha point UCNN with a longer curve UCNN (t) is just tofind an optima in a larger parameter space. Intuitively weprefer a design of CNNs that leads to a curve UCNN (t)with a length just above the geodesic distance between Iand UCNN . To achieve this, we need a balance betweenthe network depth and the complexity of CONV-ReLU-Pooling operations. This observation will play a key rolein the analysis of ResNet below.

• Why does ReLU work better than the sigmoid in deepCNNs?A typical answer to this problem is that ReLU canimprove the vanishing gradient problem so that a betterperformance can be achieved. Another idea is that ReLUcan change the Fisher information metric to improve theconvergence but the sigmoid function can not. Here weexplain this from a complexity point of view. As men-tioned before, the selection of the activation function willinfluence the complexity of the local operation. ObviouslyReLU is a weaker nonlinear function than the sigmoid.This means for a given CNN structure, using ReLU willhave a lower representation capability than using thesigmoid so that the CNN using ReLU has a smallercomplexity. This is to say that the a CNN using a sigmoidneed to search for an optima in a larger state space. It’sobvious that as far as the complexity of the CNN is abovethe length of the geodesic, it’s preferable to use a simplerCNN so that a better convergence can be achieved. Itseems there exists a balance between the complexity ofthe CONV operation and the activation function so thatthe complexity of the complete system can be distributedequally along the network. With our first definition of themetric, the intuitive geometric picture is that the sigmoidfunction will introduce a metric so that the curvature ofthe Riemannian manifold will be more negative, if ourguess of the curvature of TCNN is true. A higher negativecurvature of course will make the curve UCNN (t) moreunstable so that a small change of the parameters of theCNN will result in a large deviation of the final pointof UCNN (t). This corresponds to the observed vanishinggradient problem during the back-propagation procedure.So for a general deep CNN, the sigmoid is too complexand ReLU is a proper activation function that can producea Riemannian manifold with a proper curvature.

• How difficult is it to optimize the structure of CNNs? Bythe optimization of the structure of CNNs, we mean tofind the optimal CNN structure (for example the depth,the number and sizes of kernels of a CNN), which canaccomplish the classification task with the minimal com-putational complexity. Geometrically this is to constructand define a metric on the Riemannian fold TCNN , under

7

Fig. 5. The geometry of ResNets. (a)The original residual unit of ResNetsand (b) a new designed residual unit proposed in [25]. (c) The deep ResNetbuilt on residual units (a)(b). (d)ResNets hold a similar geometric picture asCNNs. The new feature of ResNet is that the curve UResNet is a smoothcurve consisted of a large number of simple near-identity transformations sothat they can be efficiently approximated by the residual unit structure.

with the length of a curve is proportional to the compu-tational complexity of a CNN structure; and (b)realizethe geodesic UgeoCNN that connects the identity operator Iwith UCNN . Generally this is difficult, especially whenthe curvature property of the Riemannian manifold is un-known. This is the same as to find the optimal realizationof an algorithm with a given universal logic gate set. Forthe geometry of quantum computation, the curvature isalmost negative so that it’s difficult to achieve a generaloperator[10]. If the manifold of the fundamental law, thequantum computation, has a negative curvature, it will notbe surprising if we assume the geometry of deep learninghas a similar property. This point seems to be partiallyconfirmed by the recent work[23], where a manifoldwith an exponentially growing curvature with the depthof a CNN can be constructed. If a CNN is applied tosuch an input data manifold with an exponentially largenegative curvature, then the optimization will be difficultto converge.

B. Geometry of residual networksResNet is a super-deep network structure that shows su-

perior performance than normal CNNs[24][25]. Given thegeometry of CNNs, it’s straight forward to draw the geometricpicture of ResNets.

Similar with CNNs, ResNets also aim to find a curveUResNet(t), t ∈ [0, 1] to reach UResNet. The difference isthat UResNet(t) consists of a large number of short linesegments corresponding to simple operations UResNet(lδ), l =1, 2, ..., L, Lδ = 1. It’s easy to tell that when L is big enough,UResNet(lδ) is near the identity operation I so that the first-order approximation of UResNet(lδ) gives UResNet(lδ) ≈I+HResNet(lδ), where HResNet(lδ) is a weak local nonlineartransformation. This gives the structure of the ResNet, i.e.,each layer of the ResNet is the addition of the original andthe residual information.

The reason that ResNet works better falls in the followingfacts:• The ResNet achieves a roughly uniform distribution of the

complexity of the problem along the curve UResNet(t).

Such a structure may lead to a smooth or stable infor-mation flow along the network and help to improve theconvergence of the optimization procedure.

• The fact that UResNet(lδ) ≈ I reduces the searchingspace of HResNet(lδ). For example, if the space TResNetis U(2n), then HResNet is in the Lie algebra u(2n),which has a smaller dimension than U(2n). This mayimprove the performance of the optimization procedure.

We can also use this framework to analysis the propertiesof ResNets observed in [25]. In [25] the ResNet is formulatedin a general form

yl = h(xl) + F (xl,Wl) (4)xl+1 = f(yl) (5)

where xl and xl+1 are input and output of the lth-level and Fis a residual function. In the original ResNet, h(xl) = xl andf = ReLU .

The main results of [25] are:

• If both the shortcut connection h(xl) and the after-addition activation f(yl) are identity mappings, the sig-nal could be directly propagated in both the forwardand backward directions. Training procedure in generalbecomes ealier when the architecture is closer to theseconditions.

• When h(xl) is selected far from the identity mapping, forexample as multiplicative manipulations(scaling, gating,1×1 convolutions and dropout), the information propaga-tions is hampered and this leads to optimization problems.

• The pre-activation of ReLU by putting the ReLU as partof F and setting f as the identity operation improves theperformance. But the impact of f = ReLU is less severewhen the ResNet has fewer layers.

• The success of using ReLU in ResNet also confirms ourconclusion that ReLU is a proper weak nonlinear functionfor deep networks and the sigmoid is too strong for deepnetworks.

From the geometric point of view, the above mentionedobservation are really natural.

For a deep ResNet, UResNet(lδ) = I+HResNet(lδ), this isexactly the case that both f and h are identical mappings. If his far from the identity mapping, then the difference betweenthe identity mapping and h need to be compensated by F ,this will make F more complex and leads to optimizationproblems.

The pre-activation also aims to set f as the identity mappingand the weak nonlinear ReLU is absorbed in F . If the ResNethas fewer layers, then UResNet(lδ) is a little far from theidentity mapping so that UResNet(lδ) = I + HResNet(lδ) ≈ReLU(I+H ′ResNet) is valid. This can explain the observationthat the pre-activation is less crucial in a shallower ResNet.

The geometric picture easily confirms that the best ResNetstructure is to replicate UResNet(lδ) = I + HResNet(lδ).Another related observation is that the highway network[26]usually performs worse than the ResNet since it introducesunnecessary complexity to the system by adding extra non-linear transformations to the system so the structure of the

8

Fig. 6. The geometry of recursive neural networks. (a) The recursive neuralnetwork architecture which parses images and natural language sentences[27]. (b)The fixed operation of a recursive neural network[27]. (c)The Liegroup exponential mapping can be taken as the geometric picture of arecursive neural network. (d)The recursive neural network can optimize boththe representation mapping and a component of the Lie algebra of the Liegroup that can reach the destination oparation.

highway network does not match the natural structure givenby the first-order approximation.

C. Geometry of recursive neural networks

Recursive neural network is commonly used in the naturalscene images or natural language processing tasks.

Taking the recursive neural network for the structure predic-tion described in [27][28] as an example, the goal is to learna function f : X → Y , where Y is the set of all possiblebinary parse trees. An input x ∈ X consists of (a) a setof activation vectors which represent input elements such asimage segments or words of a sentence, and (b) a symmetricadjacency matrix to define which elements can be merged.

The recursive neural network accomplishes this task byfinding the two sub mappings: (a) A mapping from the wordsof natural languages or segments of images to the semanticrepresentation space. Combining this map with the adjacentmatrix, the original natural input data is represented in a spaceX ; (b)A fixed rule to recursively parse the input data X togenerate a parsing tree.

Geometrically we are interested in the following aspects:• What’s the correspondent geometric picture of the fixed

recursive rule?• Why both the recursive rule and the representation map-

ping can be learned during training?It’s easy to see that given the representation mapping and

the adjacent matrix, the recursive operation is nothing but atransformation on X . The fixed operation rule reminds us itssimilarity with the Lie group exponential mapping. Thoughwe can not say X is a Lie group, but conceptually therecursive neural network can be understood as a discretizedLie group exponential mappings. Given any input in X , thetransformation between X and Y is achieved by this Lie groupexponential mapping.

In CNN and ResNet, the goal is to find a free curve witha fixed length, which is neither a geodesic nor a Lie groupexponential mapping. So the complexity or the dimensionof freedom of the curve is much higher. In recursive neuralnetworks, the exponential mapping like curve is much simpler,this is why we have the capability to optimize both therepresentation mapping and the exponential mapping simul-taneously during training.

We can also observe that the recursive neural network doesnot emphasis on local operations as in CNNs or ResNets. Thismeans that the recursive neural network is dealing a class ofproblems that is similar with the template matching problem.And its analogy with the SVF framework of the templatematching confirms this point.

D. Geometry of recurrent neural networks

Another category of neural networks is the recurrent neuralnetworks[29]. The key feature of the RNN is that it allows usto process sequential data and exploit the dependency amongdata. RNNs are widely used in language related tasks such asthe language modeling, text generating, machine translation,speech recognition and image caption generation. Commonlyused RNN structures include bi-directional RNNs, LSTMnetworks and deep RNNs.

Given a sequence of input data x = {x1, x2, ......, xT },a standard RNN compute the hidden vector sequenceh = {h1, h2, ......, hT } and the output sequence y ={y1, y2, ......, yT } for every time step t = 1, 2, ..., T as:

ht = f(Wihxt +Whhht−1 + bh) (6)yt = Whoht + bo (7)

where Wih,Whh,Who and bh, bo are the weight matrices andthe bias vectors. f is the activation function of the hiddenlayer.

Here we are not going to give a complete overview to thedifferent variations of RNNs. Instead we will try to understandthe feature of RNNs from a geometrical point of view.

Let’s first go back to the geometry of CNNs, where a CNNis a curve on TCNN to reach a point UCNN . For each givendata point on VCNN , it’s transformed by UCNN to anotherpoint on VCNN . Geometrically a CNN accomplishes a point-to-point mapping with a simple curve as its trajectory onVCNN .

RNN is more complex than CNN. We can check thegeometric picture of RNNs in two different ways.

The first picture is to regard the sequential input dataas a series of points on VRNN , then the geometric pictureof a RNN is a mapping of string, a series of intercon-nected points x = {x1, x2, ......, xT }, to a destination stringy = {y1, y2, ......, yT } with an intermediate trajectory h ={h1, h2, ......, hT }. This is not a collection of independent par-allel trajectories as (x1, h1, y1), (x2, h2, y2), ..., (xT , hT , yT )since they are coupled by the recursive structure of the RNNthrough the hidden states ht in (6).

Another picture of RNNs is to regard the sequential data,the string in the first picture, as a single data point, then the

9

trajectory of the input data is a curve in a higher dimensionalspace VRNN . Accordingly TRNN is the space of transforma-tions on VRNN . This is similar with the geometric picture ofCNNs.

The difference of these two pictures is that in the firstpicture, the transformation working on the trajectory of eachpoint of the string can be any transformation in TRNN and thetrajectories of different points are coupled with each other. Inthe second picture, the transformation working on the datapoint, the complete sequence of data x = x1, x2, ......, xT ,is now local operations in TRNN , which only work on eachindividual data xi, i = 1, 2, ..., T and pairs of successive dataxi, xi+1 also through ht as in ??. So the second picture issomehow trivial since now the RNN is only a special caseof the CNN. So we from now on we will focus on the firstpicture to see how this picture can help us to understand theRNN related topics.

For a standard RNN, only earlier trajectories will influencelater trajectories so that a later input data feels the force of ear-lier data. For bi-directional RNNs, the force is bi-directional.For deep bi-directional RNNs, the trajectory is a complex sheetconsisting of multiple interconnected trajectories.

This string-sheet picture reminds us of their counterpartsin physics. A deep CNN corresponds to the world-line of aparticle, which can achieve human vision functions. A deepRNN corresponds to the world-sheet of a string, which can beused to solve linguistic problems. It’s natural to ask, what’sthe correspondent structure of the trajectory of a membraneand what kind of brain function can be achieved by such astructure? Just as that RNNs can interconnect points into astring, we can also interconnect multiple strings to build amembrane. We can even further to interconnect multiple mem-branes to generate higher dimensional membranes. To buildsuch structures and investigate their potential functionalitiesmay lead to interesting results.

We show that in fact the LSTM, the stacked LSTM, theattention mechanism of RNN and the grid LSTM are allspecial cases of this membrane structure.• LSTM

The structure of the standard LSTM is shown in Fig. 8.It’s straight forward to see that it’s the evolution of twoparallel but coupled strings, the hidden states h and thememory m. Compared with normal RNNs where there isonly a single information passage, it’s possible that theexistence of the parallel information passages in LSTMmakes the LSTM system capable of modeling long rangedependencies.

• Stacked LSTMThe stacked LSTM is in fact a deep LSTM whosegeometric picture is a cascaded double string as in Fig.8.

• Grid LSTMThe grid LSTM[30] is an extension of the LSTM tohigher dimensions. Accordingly the strings in the stan-dard LSTM are replaced by higher dimensional mem-branes as given in [30], where a 2d deep grid LSTMcan be understood as stacked multiple 2d membranes. Soif the deep LSTM is a cascaded coupled double strings,

then the deep 2d grid LSTM is a cascaded membranes(Fig. 10). Similar geometric pictures can be constructedfor higher dimensional and deep grid LSTM systems.

• Attention mechanismThe attention mechanism is getting popular in the decoderof the machine translation and the image caption genera-tion. The key feature is that during the text generation, analignment or an attention model (a feed-forward networkin [31] or a multiple layer perceptron network in [32])is used to generate a time-varying attention vector toindicate which part of the input data should be paidmore attention to. This structure can be understood asa coupling between the normal decoder network (a RNNor a LSTM) with the attention network as shown in Fig.9.

It need to be noted that there is an essential difference be-tween the LSTM systems and the attention mechanism. We cansee that the LSTM systems are the coupling of homogeneoussubsystems but the attention mechanism is in fact a coupling oftwo heterogeneous subsystems, i.e. one decoder RNN/LSTMnetwork and one attention model with a different structure.Other similar systems, such as the recurrent models of visualattention[33] and the neural Turing machines[34][35], alsohold a similar picture of coupling heterogeneous subnetworks.

It can be expected that more complex and powerful systemscan be constructed by such a mechanism. It will be interestingto see how such a mechanism may help us to understandkey human intelligence components such as our memory,imagination and creativity.

Similar with the case of CNNs, we can also check the metricand the curvature of the manifold of the coupled networks.Roughly we are now working with the direct product of themanifold of each subsystem with the tangent spaces of thesubsystems as orthogonal subspaces of the tangent space ofthe complete manifold. This is very similar with the case ofthe semi-direct group product in the metamorphsis frameworkin the template matching system[18]. It can be expectedthat the structure of the manifold is much more complexthan the relatively simple CNN case. But if the curvatureof each subsystem’s manifold is almost negative as in thequantum computation, then the convergence of the trainingof the coupled system naturally will be more difficult, just asindicated in the case of neural Turing machines.

E. Geometry of GANsGenerative adversarial networks (GANs) is a framework

for unsupervised learning. The key idea of GANs is totrain a generative model that can approximate the real datadistribution. This is achieved by the competition between agenerative model G and a discriminative model D. In thestandard configuration of GANS, both G and D are CNN-like networks.

We now give a geometrical description of GANs and try toanswer the following questions:• Why is it difficult to train GANs?• Why can the refined GANs, including LAPGANs,

GRANs and infoGANs, perform better than standardGANs?

10

Fig. 7. The geometry of RNNs. (a)The structure of the unrolled standardRNN. (b)The geometrical picture of the RNN is given by the trajectory of astring from the initial green string to the red destination string. This is the firstway to understand the geometry of RNNs. (c) The second way to understandthe geometry of RNNs is to take the input sequence as a single data point,then the structure of RNNs is a curve as the same as CNNs. The curve isnow achieved by local operations on the data. (d) A deep RNN network withits correspondent geometric picture as a sheet swept by a string shown in (e)or a long curve shown in (f). The arrows indicate information flows

Fig. 8. The geometry of LSTM networks. (a)The standard LSTM networkwhere m and h are the memory and hidden states of the system. (b)Thegeometry of the LSTM network is given by two coupled strings, the yellowhidden state string and the blue memory string. (c) The structure of a stackedor deep LSTM network, whose geometric picture is a cascaded structure ofcoupled double strings as shown in (d). The arrows indicate information flows.

Since both G and D are realized by CNNs, the geometricpicture of GANs can be built on the geometry of CNNs. Fromthe geometric model of CNNs, it’s easy to give the geometricpicture of the standard DCGANs as a two-piece curve inTCNN as shown in Fig 11.

The two curve pieces represent the generator G and thediscriminator D respectively. The goal of the training proce-dure is to find the two transformations that can minmax thecost function of GANs. GANs are different from CNNs in thefollowing aspects:

• The goal of CNNs is to find a curve UCNN (t) that canreach a transformation UCNN from I . For GANs, weneed to find two transformations UD and UG−1 by con-structing two curves to reach them from I simultaneously.

• The target transformation of D is relatively simple butthe destination transformation of G is much flexiblethan UCNN . This can be easily understood since weset more constraints on UCNN by labeling the training

Fig. 9. The geometry of the attention mechanism. We take the languagetranslation system in [32] as an example. The encoder of this system is a bi-directional RNN network which generates encoded information hi from inputseries xi. The decoder is a coupling of a RNN network (the light blue string)and an alignment model (the brown string) by exchanging state informationsi of the RNN and the attention information ai from the alignment model,which generates the output yi from the encoded data hi. This is a couplingof two heterogeneous networks. The arrows also indicate information flows

Fig. 10. The geometry of the 2d deep grid LSTM. The 2d deep grid LSTM(left) has a geometric picture of multiple stacked membranes, where eachmembrane is constructed by coupling multiple LSTM strings in 2d space.

data in CNN based object classification systems. An easyexample to illustrate this point is shown in Fig. 12. BothCNNs and the inverse of G aim to transform the trainingdata into an Euclidean space so that the training data areproperly clustered. For CNNs, the labeling of trainingdata results in that the output clustering pattern of CNNsis relatively unique. But for the unsupervised learningof the generator of GANs, their are a set of equivalentclustering patterns, which may compete with each otherduring the training procedure.

• The information pathways are also different in GANs andCNNs as shown in Fig. 11. Firstly we can see that in thedesign of GANs, the basic idea is to learn a representationof the training data with G−1 so that the generatedexamples from G can have the same distribution ofthe training data, i.e. the generated data can fool thediscriminator D. So the information flow in this designis to start from the training data, go through the reversegenerator G−1, return back through G and pass the

11

discriminator D. Only at this point, the output of D cansend error information back to guide the update structureof G. During the network training, the information flowfor D is similar with a normal CNN. But the optimal Gis defined by the cost function of GANs, which needan information passage passing both G and D to setconstraints on G, which is longer than the CNN case.

• The geometric picture of GANs is to find a two-segmentcurve connecting the input space of G to the output spaceof D, which is asked to pass the neighbourhood of thetraining data in a proper way. But we can only use theinformation from the output of D to justify if the curvefulfills our requirement. So any loss of information atthe output of D, due to either the mismatch between thetraining of G and D or an improper cost function thatfails to measure the distance between the curve and thetraining data, will deteriorate the convergency of GANs.

Given the above observations, the reason that GANs aredifficult to be trained is straight forward.• Compared with CNNs, GANs have the same Riemannian

manifold as TCNN , which has a high possibility of own-ing a negative sectional curvature. GANs have to optimizetwo curves and a more flexible target transformation Gon this manifold.

• The information pathways are longer in GANs than inCNNs. Also the information pathways in the trainingprocedure do not coincide with the information pathwaysin the designing idea of GANs. This mismatch may leadto a deterioration of the system performance since theinformation flow in the G−1 direction is lost in thetraining procedure.

To eliminate the difficulties of GANs, there exist the fol-lowing three possible strategies:• To decrease the flexibility of the generator transformationUG by adding more constraints to UG.

• To improve the structures of the curves of G and D.From the analysis of CNNs, we know this is to change themetric of the manifold TCNN or equivalently to rearrangethe lengths of the line segments in the curves of G andD.

• To eliminate the information loss at the output of D sothat it’s more reliable to justify how the two-segmentcurve of GAN passes the neighbourhood of the trainingdata. Basically this is exactly what the WGAN does.

• To change the information pathways during the trainingprocedure. This may improve the efficiency of the infor-mation flow during the training so that possibly a betterconvergence may happen.

Now we explain how LAPGANs, GRANs and infoGANscan improve the performance of GANs utilizing the abovegiven strategies.• InfoGAN

InfoGANs aim to decrease the flexibility of the trans-formation UG−1 or equivalently UG by introducing extraconstraints on UG. The original GAN cost function isbased on the sample based expectation of the networkoutputs. On the contrary, infoGANs enhance the con-

Fig. 11. The geometry of GANs. The generator G and discriminator Dof a GAN correspond to two curves connecting the identity operation Iwith the transformations UD and UG−1 respectively. During the trainingprocedure, the information flow (the feed-forward and the back-propagationof information) of D and G are shown by the blue and red dash lines. In thedesign of the GANs, the information flow is shown in the black dash line. Theunsupervised learning of the generator G of GANs leads to a much flexibletarget transformation UG−1 compared with the target transformation UCNN

in CNNs.

Fig. 12. The comparison of the target transformation of CNNs and GANs.We consider a simple two-cluster case where the training data are imagesof cats and birds. Direct clustering and interpolation on the image manifold(left) is difficult. Both the CNN and the GAN map the training data into a 2dimensional Euclidean space to achieve a proper data clustering (right). Forthe CNN, the output clustering pattern is unique. But for the GAN, we havemultiple equivalent patterns that are all optima of the cost function of GANsystems.

12

Fig. 13. Comparisons of the hierarchical structures in LAPGANs,GRANsand infoGANs. Left column: The training data space; Central column: Thestructure of different GAN systems; Right column: The representation spaceat the input of generators. LAPGANs: The hierarchical structure is built onthe frequency bands of all the training data and a corresponding structure inthe representation space is also constructed by LAPGAN’s pyramid structure.GRANs: The hierarchical structure is a kind of self-similarity or a recursivestructure on the training data space encoded in the RNN based generator.infoGANs: The hierarchical structure is based on the clustering of the trainingdata, where the clustering are indicated by the latent variables ({ci}) in thegenerator’s input data space, so that this cluster based structure is kept by thegenerator and can be verified by the mutual information used in infoGANs. Ineach system, a hierarchical correspondence between the training data spaceand the representation space is built, which can effectively introduce moreconstraints on the transformation UG so that the convergence of the trainingprocedure can be improved.

straints on UG by setting constraints on expectations ofclusters of sample data. This means that the clusteringpattern in the representation space of the training data,indicated by the latent variables in the input data spaceof the generator G, should be kept by the generator,so that in the hidden states of the discriminator (as akind of decoder of the generator to reveal the clusteringpattern) the same clustering pattern can be observed. Thisis exactly why the infoGAN add a mutual informationitem, which is used to check if the clustering patten, toits cost function. This cluster based constraints in factconstructs a hierarchical structure in both the training dataspace and the representation space. Asking the generatorG to keep this hierarchical structure effectively reduce theflexibility of UG and improves the convergence stability.

• LAPGANThe key feature of LAPGANs is to decompose the imagespace into orthogonal subspaces with different frequencybands and run different GANs sequentially on subspacesat higher frequency bands conditioned on the resultson lower frequency bands. From our geometrical pointof view, the decomposition of input data space resultsin a correspondent decomposition of the transforma-tion space TGAN into multiple subspaces T kGAN wherek = 1, 2, ...,K with K the number of bands of theparamid. So the pyramid based LAPGAN is essentiallyto construct the curve UD(t) and UG−1(t) in TGAN by a

set of curves in each subspace T kGAN using simpler localoperations. This is similar to achieving a displacement in3 dimensional space by three displacements in x, y andz directions.The reason such a structure can improve the convergenceof LAPGANs can be understood from different points ofview.Firstly LAPGANs replace a complex curve UG(t) by aset of simpler curves UkG(t) in each orthogonal subspaceT kGAN and then optimize those curves sequentially. Sothe local operations constructing the curve UkG(t) aresimpler and the curve UkG(t) is also shorter than UG(t).From our above analysis of CNNs and ResNets, simplerlocal operations will introduce a flatter transformationmanifold or equivalently a curve consisted of shorter linesegments. This will improve the convergence property ofeach curve UkG. This is to say a curve UG(t) with complexlocal operations by a series of shorter curves {UkG} withsimpler local operations in subspaces {T kGAN}.Secondly in LAPGANs, a set of generators {Gk} anddiscriminators {Dk} are trained with each subband oftraining data. Obviously each pair of Gk and Dk de-termines a representation pattern of one data subband.Also these representations are consistent since duringthe training procedure, the generator Gk at a higherfrequency band is conditioned on the lower band image.This means LAPGANs replace the original cost functionon the expectation of the complete data set by a set ofcost functions on the expectation of each frequency bandof the input data. This is in fact to construct a hierarchicalstructure on the training data, i.e. the paramid, and set aset of consistent constraints on this hierarchical structure.So essentially we add more constraints to the generatortransformation UG.

• GRANGRANs improve the performance of GANs by a RNNbased generator so that the final generated data is con-structed from the sequential output of the RNN network.We already know that the RNN essentially decomposea single input data point of a normal CNN networkinto a data sequence and therefore the trajectory of adata point of CNNs becomes a sheet swept by a stringin RNNs. From this point of view, the operations of aRNN are essentially all simpler local operations of thecomplete input data. What’s more wimilar with LAP-GANs, GRANs also explore a hierarchical structure of thesystem which is encoded in the RNN structure. But thishierarchical structure is not so explicit as in LAPGANs.In another word, LAPGANs explore the structure infrequency domain and GRANs explore a kind of self-similarity like structure of the input data.

As a summary, LAPGANs, GRANs and infoGANs all im-prove the performance by constructing a hierarchical structurein the system and adding more constraints on UG. Theirdifference lies in their ways to construct the hierarchy. What’smore, LAPGANs and GRANs can potentially further improvethe performance by changing the structure of the curve UG(t)

13

by only using simpler local operations compared with thestandard GANs.

Another possible way to improve the performance is tochange the information passway. For example in the trainingof G and D, the training data information is fed to D directlybut to G indirectly. Feeding the training data directly to G inthe inverse direction (as the information passway given by theblack dash arrow in Fig. 11) can potentially further explorethe structure of the representation space z.

F. Geometry of equilibrium propagation

Equilibrium propagation[36] is a new framework for ma-chine learning, where the prediction and the objective functionare defined implicitly through an energy function of the dataand the parameters of the model, rather than explicitly as in afeedforward net. The energy function F (θ, β, s, v) is definedto model all interactions within the system and the actions withthe external world of the system, where θ is the parameter tobe learned, β is a parameter to control the level of the influenceof the external world, v is the state of the external world(inputand expected output) and s is the state of the system.

The cost function is define by

Cδβ(θ, s, v) := δT · ∂F∂β

(θ, β, sβθ,v, v) (8)

where δ is a directional vector in the space of β so thecost function is the directional derivative of the function β 7→F (θ, β, s, v) at the point β in the direction δ.

For fixed θ,β and v, a local mininum sβθ,v of F whichcorresponds to the prediction from the model, is given by

∂F

∂s(θ, β, sβθ,v, v) = 0 (9)

Then the equilibrium propagation is a two-phase procedureas follows:

1) Run a 0-phase until the system settle to a 0-fixed poinsβθ,v and collect the statistics ∂F

∂θ (θ, β, sβθ,v, v).2) Run a ξ-phase for some small ξ 6= 0 to a ξ-fixed poin

sβ+ξδθ,v and collect the statistics ∂F∂θ (θ, β, sβ+ξδθ,v , v).

3) Update the parameter θ by

∆θ ∝ −1

ξ(∂F

∂θ(θ, β, sβ+ξδθ,v , v)− ∂F

∂θ(θ, β, sβθ,v, v)).

(10)

An equivalent constrained optimization formulation of theproblem is to find

argθ,s minCβθ,s(θ, s, v) (11)

subject to the constrain

∂F

∂s(θ, β, sβθ,v, v) = 0. (12)

This leads to the Lagrangian

L(θ, s, λ) = δ · ∂F∂β

(θ, β, sβ+ξθ,v , v)+λ · ∂F∂s

(θ, β, sβθ,v, v) (13)

By this formulation we can see the update of θ in equi-librium propagation is just one step of gradient descent on Lwith respect to θ as

∆θ ∝ −∂L∂θ

(θ, s∗, λ∗) (14)

where s∗, λ∗ is determined by

∂L

∂λ(θ, s∗, λ∗) = 0 (15)

and∂L

∂s(θ, s∗, λ∗) = 0 (16)

It’s claimed that the ξ-phase can be understood as perform-ing the back-propagation of errors. Because when a small ξδon β is introduced, the energy function F induces a new’external force’ acting on the output units of the system todrive the output to output target. This perturbation at the outputlayer will propagates backward across the hidden layers of thenetwork so that it can be thought of a back-propagation.

Here we will show that the geodesic shootingalgorithm[20][19] in the diffeomorphic template matchingis in fact an example of the equilibrium propagation. Wewill use it to give a geometrically intuitive picture of theequilibrium propagation to show how the back-propagation isachieved in the ξ-phase.

The geodesic shooting addresses that diffeomorphic tem-plate matching as find the optimal matching g(t), g(t) =u(t) ◦ g(t) between two images I0 and I1. Under the smoothtransformation of g(t), we define the transformed source imageas I(t) = I0 ◦ g(t). The energy function is given as

F =

∫ 1

0

〈u(t), u(t)〉vdt+ β‖I1 − I0 ◦ g1‖2L2 (17)

In the language of the equilibrium propagation, v corre-sponds I0 and I1 as the input and output data, the state ofthe system s is given by I(t), g(t), u(t)). The cost function isCβθ,s(θ, s, v) = δ‖I1 − I0 ◦ g1‖2L2 .The prediction of the states∗ is obtained by ∂F

∂s = 0, which leads to the Euler-Pointcareequation given by

I(t) = −P (0)I(t) · u(t) (18)˙I(t) = −P (0) · (I(t)u(t)) (19)

Lu(t) = −− P (0)I(t) · I(t) (20)I(1) = −(I(1)− I1) (21)

where Lu(t) is the momentum map and I(t) is the adjointvariable of I(t). It’s easy to know that the optimal state s∗ iscompletely determined by u(0), which can be regarded as theparameter θ to be optimized.

Taking the EP equation as the constraint, the variation of13 results in

14

I(t) = −∇I(t) · u(t) (22)P (t) = −∇ · (P (t)u(t)) (23)Lu(t) = −− P (0)I(t) · P (t) (24)

˙I(t) = −∇ · (I(t)u(t))−∇ · ((K ∗ u(t)P (t)))(25)˙P (t) = −∇P (t) · u(t) + (K ∗ I(t)) · ∇I(t) (26)

K ∗ I(t) = −u(t)−K ∗ (I(t)∇I(t)−∇P (t)P (t))(27)I(1) = −(I(1)− I1) (28)P (1) = 0 (29)

−P (0) = ∇P (0)Cβθ,s(θ, s, v) (30)

where P (t) = Lu(t) is the momentum map of u(t).In the geodesic shooting framework, the above equations

include two components:• Shooting The first three equations determine I(t), P (t)

given the initial momentum P (0) and I(0) = I0.• Adjoint advection

The second five equations determines I(t), P (t) by solv-ing the equations in the reverse time direction given theboundary conditions of I(1), P (1).

• GradientThe last equation given the gradient of the cost functionCβθ,s(θ, s, v) on the initial momentum P (0), while updat-ing P (0) equals to update the parameter θ = u(0).

It’s very easy to see that the three components correspondexactly to the three step in the equilibrium propagation. Theshooting and advection phases correspond to the 0-phase andξ-phase. The gradient ∇P (0)Cβθ,s(θ, s, v) is equivalent to thegradient descent ∆θ ∝ −∂L∂θ (θ, s∗, λ∗).

This example gives an intuitive picture of the equilibriumpropagation as follows:• The 0-phase is to shoot a geodesic staring from a given

point with a given initial velocity vector.• The ξ-phase is to propagate the error information (the

distance between the end point of the geodesic I(1) andthe expected destination I1) in a backward direction.

• The update of parameter θ is based on the information−P (0) obtained by the back-propagation procedure.

Of course, the equilibrium propagation is a general frame-work for deep learning. We can not give a concrete geometricdescription of it due to its flexibility. Here we just show that insome special cases, the equilibrium propagation does have ageometric picture, in which how the back-propagation of errorinformation can be directly observed.

Another observation is that the equilibrium propagation isnot only a framework for training the so called implicitlydefined deep learning systems as described in [36], in factit’s also potentially a framework to optimize the structure ofthe explicitly defined deep learning systems as well.

To see this point, taking the deep CNN as an examplewhere the prediction is explicitly defined by the structure ofthe CNNs. We can define an energy as

F (θ, β, s, v) = Ecomp(θ, β, v) + ECNN (θ, β, s, v) (31)

where the parameter θ to be learned includes also the param-eters of the structure of a CNN, for example the depth, thenumber and size of each layer, besides the normal papametersof a CNN. ECNN (θ, β, s, v) is then the cost function for aCNN with a given structure defined by θ and Emeta(θ, β)defines the complexity of the CNN, i.e., the number ofoperations of the feed-forward network structure defined byθ, β. Ecomp(θ, β, v) does not depend on s since the state ofthe CNN s is determined by θ and v.

If we run the equilibrium propagation on this system, thesystem prediction s∗ can be achieved by the current gradientdescent based CNN training method, i.e., training a CNNwith a fixed configuration. Then the optimization on θ isin fact a global optimization framework to find the optimalCNN structure defined by θ∗. Geometrically this is to find aRiemannian manifold TCNNθ∗ with its metric gtheta

∗

CNN definedby θ∗, so that the optimal CNN can realize a geodesicUgeoCNN (t), which gives a minimal system complexity, i.e., themost efficient CNN structure that can achieve the task.

Of course, the discrete parameters of the structure of theCNN can not be updated by the gradient descent strategy. Thisis the same as in the reinforcement learning neural Turingmachines[34] if we regard the discrete structure parametersas the actions in the neural Turing machines. So theoreticallywe can train the systems also with the reinforcement learningto find the optimal structure parameters, while the back-propagation in [34] is replaced by the equilibrium propagation.Also this framework might not work due to the complexity inpractice, just as the neural Turing machine can only achievesimple algorithmic tasks.

IV. CONCLUSIONAL REMARKS

This paper is based on the belief that geometry plays acentral role in understanding physical systems. Based on abrief overview of the geometric structures of two fields, thequantum computation and the diffeomorphic computationalanatomy, we show that deep learning systems hold a similargeometric picture and the properties of deep learning systemscan be understood from this geometric perspective.

We summarize our observations on different deep learningsystems as follows:

Convolutional neural networks• The structure of a CNN defines a manifold, the trans-

formation space TCNN , and the Riemannian metric onit.

• The goal of a CNN is to find a transformation UCNN onthis Riemannian manifold, which is achieved by a curveUCNN (t) defined by the local operations of the CNNstructure.

• A CNN is not to find the geodesic but to find a curvewith a given structure to reach a point in the Riemannianmanifold. A deep CNN works only when the curve itdefines is longer than the geodesic between the identityoperator I and UCNN . A too shallow network will neverwork efficiently.

• The convergence property of a CNN depends on boththe curvature of the Riemannian manifold and the com-plexity distribution of the curve. The linear complexity

15

defined by the configuration of kernels and the nonlinearcomplexity defined by the activation function should bebalanced. A too deep network with improper complexitydistribution may also fail.

• Generally to find the optimal CNN structure is difficult.• How the structure of a CNN influence the curvature of

the manifold and the performance of the network needsto be investigated.

• An alternative geometric structure is built on the dataspace VCNN , where a CNN works by disentangling ahighly curved input manifold into a flatter output mani-fold.

Residual networks• Residual networks is a special case of CNNs, where the

curve to reach the target transformation consists of a largenumber of short line segments.

• Each line segment corresponds to a near-identity trans-formation on the manifold, which can be achieved by theone-order approximation. The optimal ResNet structureis determined by this one-order approximation.

• The superior performance of ResNet comes from thebalanced complexity distribution along the curve and apotentially smaller parameter space defined by the ResNetstructure.

Recursive neural networks• Recursive neural networks have a structure that can be

understood as a Lie group exponential mapping if themanifold of transforamtions is regarded as a Lie group.

• The relative simple structure of the recursive neural net-work makes it possible to find the representation mappingand the recursive operation during the training procedure.

Recurrent neural networks• Recurrent neural networks can be understood as a sheet

swept by a string on the data space VRNN .• A stacked or deep LSTM network accomplishes a cas-

caded structure of a pair of coupled strings.• The grid LSTM network extends the coupled stings into

weaved membranes.• The attention mechanism and the neural Turing machine

are coupled structure of heterogeneous subnetworks.Generative adversarial networks• Generative adversarial networks are the simultaneous

optimization of two curves leading to UG and UD, inTGAN , where each curve has a similar structure as CNNs.The unsupervised learning of the target transformationUG is much flexible than than the supervised learning ofUCNN . These features explain the difficulty of trainingGANs.

• LAPGANs achieve a better performance by both explor-ing the frequency domain structure of the training dataand constructing the curve UG by simpler local operationson subspaces of the training data.

• GRANs also work in a similar way as LAPGAN. Thedifference is that GRANs construct a self-similarity likestructure which is encoded in the RNN structure of G.Also GRANs only apply simpler local operations inTGAN .

• InfoGANs reduce the flexibility of UG by setting extraconstraints on clusters of data samples.

Equilibrium propagation• The geodesic shooting algorithm of the template match-

ing is an example of the equilibrium propagation. Itprovides an intuitive picture of how the equilibrium prop-agation can fill the gap between the implicit framework ofoptimization of a cost function and the back-propagationof information in the explicit system.

• Theoretically the equilibrium propagation can be used asa framework for find the optimal deep learning system.

REFERENCES

[1] H. W. Lin and M. Tegmark, “Why does deep and cheap learning workso well?,” arXiv:1608.08225, 2016.

[2] A. Ashtekar and T. A. Schilling, “Geometrical formulation of quantummechanics,” arXiv:gr-qc/9706069, 1997.

[3] H. Heydari, “Geometric formulation of quantum mechanics,”arXiv:1503.00238, 2015.

[4] O. Andersson and H. Heydari, “Geometry of quantum evolution formixed quantum states,” Physica Scripta, arXiv:1312.3360v1, 2014.

[5] P. Levay, “The geometry of entanglement: metrics, connections and thegeometric phase,” Journal of Physics A: Mathematical and General,arXiv:quant-ph/0306115v1, 2004.

[6] P. Levay, “The twistor geometry of three-qubit entanglement,”arXiv:quant-ph/0403060v1, 2003.

[7] M. V. Raamsdonk, “Building up spacetime with quantum entanglement,”International Journal of Modern Physics D, arXiv:1005.3035v1, 2010.

[8] D. Stanford and L. Susskind, “Complexity and shock wave geometries,”arXiv:1406.2678v2, 2014.

[9] L. Susskind and Y. Zhao, “Switchbacks and the bridge to nowhere,”arXiv:1408.2823v1, 2014.

[10] M. G. M. A. Nielsen, M. R. Dowling and A. C. Doherty, “Quantumcomputation as geometry,” Science 311,1133, 2006.

[11] M. R. Dowling and M. A. Nielsen, “The geometry of quantum compu-tation,” Quantum Information and Computation, vol. 8, no. 10, pp. 861–899, 2008.

[12] M. F. Beg, M. I. Miller, A. Trouve, and L. Younes, “Computing largedeformation metric mappings via geodesic flow of diffeomorphisms,”Int. Journal of Computer Vision, vol. 61, pp. 139–157, 2005.

[13] M. Hernandez, M. Bossa, and S. Olmos, “Registration of anatomicalimages using paths of diffeomorphisms parameterized with station-ary vector flows,” International Journal of Computer Vision, vol. 85,pp. 291–306, 2009.

[14] M. Lorenzi and X. Pennec, “Geodesics, parallel transport and one-parameter subgroups for diffeomorphic image registration,” Interna-tional Journal of Computer Vision, pp. 12:1–17, 2012.

[15] S. Durrleman, X. Pennec, A. Trouve, G. Gerig, and N. Ayache,“Spatiotemporal atlas estimation for developmental delay detection inlongitudinal datasets,” in MICCAI, vol. 12 of 297-304, 2009.

[16] D. D. H. M. Bruveris, F. Gay-Balmaz and T. Ratiu, “The momentummap representation of images,” Journal of Nonlinear Science, vol. 21,no. 1, pp. 115–150, 2011.

[17] M. Bruveris and D. Holm, “Geometry of image registration: Thediffeomorphism group and momentum maps,” Lecture Notes, 2013.

[18] D. D. Holm, A. Trouve, and L. Younes, “The euler-poincare theory ofmetamorphosis,” Quarterly of Applied Mathematics, vol. 67, pp. 661–685, 2009.

[19] F. X. Vialard, L. Risser, D. Rueckert, and D. D. Holm, “Diffeomorphicatlas estimation using geodesic shooting on volumetric images,” Annalsof the BMVA, vol. 5, pp. 1–12, 2012.

[20] F. X. Vialard, L. Risser, D. Rueckert, and C. J. Cotter, “Diffeomorphic3d image registration via geodesic shooting using an efficient adjointcalculation,” Int. Journal of Computer Vision, vol. 97, pp. 229–241,2012.

[21] F. X. Vialard, L. Risser, D. D. Holm, and D. Rueckert, “Diffeomorphicatlas estimation using karcher mean and geodesic shooting on volumetricimages,” in Medical Image Understanding and Analysis, 2011.

[22] M. M. Postnikov, Geometry VI: Riemannian Geometry, Encyclopedia ofmathematical science. Springer, 2001.

16

[23] B. Poole, S. Lahiri, M. Raghu, J. Sohldickstein, and S. Ganguli,“Exponential expressivity in deep neural networks through transientchaos,” 2016.

[24] S. R. K. He, X. Zhang and J. Sun, “Deep residual learning for imagerecognition,” arXiv:1512.03385, 2015.

[25] S. R. K. He, X. Zhang and J. Sun, “Identity mappings in deep residualnetworks,” arXiv:1603.05027, 2016.

[26] K. G. R. K. Srivastava and J. Schmidhuber, “Highway networks,”arXiv:1505.00387, 2015.

[27] A. Y. N. R. Socher, C. Lin and C. D. Manning, “Parsing natural scenesand natural language with recursive neural networks,” in Proceedings ofthe 28 th International Conference on Machine Learning, 2011.

[28] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, andC. Potts, “Recursive deep models for semantic compositionality over asentiment treebank,” 2013.

[29] J. B. Z. C. Lipton and C. Elkan, “A critical review of recurrent neuralnetworks for sequence learning,” arXiv:1506.00019, 2015.

[30] I. D. N. Kalchbrenner and A. Graves, “Grid long short-term memory,”arXiv:1507.01526, 2015.

[31] R. K. K. C. A. C. R. S. R. Z. K. Xu, J. L. Ba and Y. Bengio, “Show,attend and tell: neural image caption generation with visual attention,”2015.

[32] K. C. D. Bahdanau and Y. Bengio, “Neural machine translation by jointlylearning to align and translate,” arXiv:1409.0473, 2014.

[33] A. G. V. Mnih, N. Heess and K. Kavukcuoglu, “Recurrent models ofvisual attention,” pp. 2204–2212, 2014.

[34] G. W. A. Graves and I. Danihelka, “Neural turing machines,”arXiv:1410.5401, 2014.

[35] W. Zaremba and I. Sutskever, “Reinforcement learning neural turingmachines - revised,” arXiv:1505.00521, 2015.

[36] B. Scellier and Y. Bengio, “Equilibrium propagation: Bridging the gapbetween energy-based models and backpropagation,” arXiv:1602.05179,2016.

How deep learning works — The geometry of deep … deep learning works — The geometry of deep...

Documents

Transcript of How deep learning works — The geometry of deep … deep learning works — The geometry of deep...