Deep neural networks, Tensor networks, and Renormalization ......Deep neural networks, Tensor...

Deep neural networks, Tensor networks,and Renormalization group

Submitted By:

HAFIZ ARSLAN HASHIM

2016−12−0005

Supervised By:DR. BABAR AHMED QURESHI

MS ThesisMAY 2018

Department of Physics

SYED BABAR ALI SCHOOL OF SCIENCE AND ENGINEERINGLAHORE UNIVERSITY OF MANAGEMENT SCIENCES

ABSTRACT

Neural networks are one of the most successful systems that have been used to ac-complish machine learning tasks, in particular in the context of unsupervised learningand pattern/feature recognition. Deep neural algorithms consist of multiple layers ofoperational nodes where the output from each layer acts as an input for the next deeperlayer. Deep neural networks (DNNs) have been used in physical applications as well, inparticular for finding ground states of complicated many-body systems. Despite their suc-cess in practical machine learning applications, there is little theoretical understandingas to why these algorithms work so efficiently. A physical process-based understanding ofneural networks will not only allow for better algorithms of many-body physical systemsbut will also help in the construction of better DNN algorithms for specific tasks.

During the MS thesis work, we have explored various models that can connectDNNs to physical systems with the hope of gaining insight into both DNNs workingand the corresponding physical systems. Recently, it has been proposed DNN may berelated to variational renormalization group of Kadanoff and a map between the twowas established. We studied this relationship for Restricted Boltzmann Machines (RBM)and renormalization group for spin systems with many examples. Another avenue whichwe have investigated is the relationship of DNNs to tensor network (TN) models. Thecorrespondence between TN and DNNs will allow quantifying the expressibility of DNNsfor industrial dataset as well as for quantum states. Our main goal is to study thispossibility and understand how do renormalization and entanglement emerge in thiscontext.

i

DEDICATION AND ACKNOWLEDGEMENTS

F irst of all, I like to thank Allah Almighty for giving me the strength to do thisthesis. Afterwards, I would like to say that this work would not be possible withoutthe guidance of my supervisor, support of my family and help from my friends.

I am very much grateful to my supervisor Dr. Babar Ahmed Qureshi. At the timeof enrolling thesis, he appreciated my interest in a particular topic and fully supportedand guided me throughout the duration. Without him, at least I had to change my topicof MS thesis. Because of his guidance and patience, I am able to complete my researchwork under his supervision.

Moreover, I am thankful to all of my friends at LUMS, Lahore, and Rahim Yar Khan,especially Hassaan Wasalat, Fawad Masood, Junaid Saif Khan, Asif Nawaz, HassaanAhmed, Waqar Ahmed, Usman Rasheed, Yasir Iqbal, Yasir Abbass and Shania for allkind of their support and encouragement. I should also express my gratitude towardsthe staff of my department especially Mr. Arshad Maral for his support.

Finally, I would like to gratefully acknowledge the prayers and encouragement of myfamily and parents throughout my studies.

ii

AUTHOR’S DECLARATION

LAHORE UNIVERSITY OF MANAGEMENT SCIENCES

Department of Physics

CERTIFICATE

I hereby recommend that the thesis prepared under my supervision by Hafiz ArslanHashim on Deep neural networks, Tensor networks, and Renormalization group beaccepted in partial fulfilment of the requirements for the MS degree.

Supervisor: Co-supervisor:Dr. Babar Ahmed Qureshi Dr. Adam Zaman Chaudhry

__________________________ __________________________

Recommendation of Thesis Defense Committee :

Dr. Ata ul Haq ————————————————————Name Signature Date

Dr. Ammar Ahmed Khan ——————————————-Name Signature Date

iii

TABLE OF CONTENTS

Page

List of Figures vii

1 Introduction 11.1 Restricted Boltzmann machine . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Deep neural network in renormalization group perspective . . . . . . . . . 4

1.3 Correspondence between RBM and Tensor Networks (MPS) . . . . . . . . . 5

2 Machine Learning Basics 82.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 The Task, T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 The Performance measure, P . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.3 The Experience, E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Gradient descent algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Conditional Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Logistic Regression or Classification . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Overfitting, Underfitting, and Regularization . . . . . . . . . . . . . . . . . 17

2.7 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.8 Deep Feed Forward networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.9 Unsupervised Learning for Deep Neural Networks . . . . . . . . . . . . . . 21

2.10 Energy-based models and Restricted Boltzmann Machine . . . . . . . . . . 22

2.10.1 Energy based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.10.2 Hidden Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.10.3 Conditional Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.10.4 Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

iv

TABLE OF CONTENTS

2.10.5 Restricted Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . 25

2.10.6 Gibbs Sampling in RBMs . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.11 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Tensor Networks 313.1 Necessity of Tensor Network? . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Theory of Tensor Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1 Tensors and tensor networks in tensor network notation . . . . . . 37

3.2.2 Wave function as a set of small tensors . . . . . . . . . . . . . . . . . 39

3.2.3 Entanglement entropy and Area-law . . . . . . . . . . . . . . . . . . 41

3.2.4 Proven instances and violations of Area-law . . . . . . . . . . . . . . 42

3.2.5 Entanglement spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Matrix Product States (MPS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.1 Some properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.2 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . 45

3.3.3 MPS construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.4 Gauge degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 1-D Projected Entangled Pair States(PEPS) . . . . . . . . . . . . . . . . . . 49

3.5 Examples of MPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Mapping between Variational RG and RBM 544.1 Renormalizaton group (RG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1.1 1D Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.2 2D Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Variational RG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Overview of RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4 Mapping between Variational RG and RBM . . . . . . . . . . . . . . . . . . 63

4.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5.1 Ising model in 1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5.2 Ising model in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Correspondence between restricted Boltzmann machine and Tensornetwork states 685.1 Transformation of an RBM to TNS . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1.1 Direct transformation from RBM to MPS . . . . . . . . . . . . . . . 69

5.1.2 Optimum MPS representation of an RBM . . . . . . . . . . . . . . . 71

v

TABLE OF CONTENTS

5.1.3 Inference of RBM to MPS mapping . . . . . . . . . . . . . . . . . . . 74

5.2 Representation of TNS as RBM: sufficient and necessary conditions . . . . 77

5.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3 Implication of the RBM-TNS correspondence . . . . . . . . . . . . . . . . . . 81

5.3.1 RBM optimization by using tensor network methods . . . . . . . . . 81

5.3.2 Unsupervised learning in entanglement perspective . . . . . . . . . 82

5.3.3 Entanglement: a measure of effectiveness of deep learning as com-

pared to shallow ones . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

A MPS Examples in Mathematica 85A.1 W State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

A.2 GHZ State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

B Renormalization Group Example and Code description 87B.1 1D Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

B.2 Training DBN for the 2D Ising model . . . . . . . . . . . . . . . . . . . . . . 90

Bibliography 91

vi

LIST OF FIGURES

FIGURE Page

1.1 The structure of Restricted Boltzmann machine. . . . . . . . . . . . . . . . . . 4

1.2 Tranformation of RBM to MPS: (a) Graphical notation of the RBM defined

by Eq.1.2. (b) The matrix product state (MPS) in graphical notation. Here

dangling links corresponds to physical variables vi and A(i) is three index

tensor. The thickness of horizontal link between tensors shows the virtual

bond dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Gradient descent: initialize w randomly, take gradient on this point and

next point will be updated according to the gradient. If the gradient is negative

as shown here then updated value of w will be greater by the factor α∂wJ. If

the gradient is positive then next value of w will be less by the same factor.

GD algorithm takes small steps in this way and reaches to the minima. . . . 14

2.2 Sigmoid function: this function outputs a smooth value between 0 and 1. . 17

2.3 Overfitting and underfitting: (a) underfitting problem is occured because

data is more complex than a linear model; (b) cubic model is appropriate for

the provided data; and (c) overfitting, polynomial of degree 5 is used as a

model which is more complex than given data. . . . . . . . . . . . . . . . . . . . 18

2.4 Effect of λ on the model: (a) λ value is too large which makes w → 0; (b)

moderate value of λ is used; (c) when λ→ 0 then w becomes very large and

model will be more complex than data. . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Neuron’s simple model:(a) Perceptron: takes weighted binary input and

outputs a binary value; (b) sigmoid neuron it also take weighted input but

outputs smooth real value between 0 and 1. . . . . . . . . . . . . . . . . . . . . 19

2.6 Deep feed forward neural network: network consists of l layers and each

layer has sn units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7 Boltzmann machine: an undirected graphical model. . . . . . . . . . . . . . . 24

vii

LIST OF FIGURES

2.8 Restricted Boltzmann machine (RBM): no intra-layer connection but hid-

den and visible units can interact with each other. . . . . . . . . . . . . . . . . 26

2.9 Softplus function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.10 Markov chain: starting with an example vt sampled from the empirical

probability distribution. Sample hidden layer given visible layer and then

sample visible given hidden layer, this process go on and on. . . . . . . . . . . 28

2.11 Deep Belief Network is defined as a generative model, the generative path

is from top to bottom with distributions P, and Q distributions extract multi-

ple features from the input and construct an abstract representation. The top

two layers define an RBM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 Tensor Network representation of state : the tensor is an elementary

building block of a tensor network and tensor network is a representation of

a quantum state (we used the graphical interpretation for a tensor network

which will obvious in incoming sections). . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Tensor network diagram examples: (a) Matrix Product State (MPS) for 4

site with open boundary condition (OBC); and (b) Projected Entanglement

Pair State (PEPS) for 4×4 lattice with OBC. . . . . . . . . . . . . . . . . . . . . 34

3.3 Area Law: entanglement entropy S of the reduced system A scales with the

boundary of the system not with volume. . . . . . . . . . . . . . . . . . . . . . . 35

3.4 The physical state in a small manifold of Hilbert space: the size of the

Hilbert space containing the states which obeys area-law is exponentially

small corner of the gigantic Hilbert space. . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Tensor representation by diagram: here we use superscript for physical

index as shown in (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6 Tensor graphical notation: these tensor diagrams corresponds to Eq.3.1,3.2,3.3,3.4,

and notice that in (b) and (d), bond between tensors B and C is thick which

corresponds to higher bond dimensions as compared to other links between

the tensors. In practice, we can split and merge any number of indices. In

this case, two indices y, z are merged. There are two and three open indices

in diagram (a) and (b) respectively. While diagrams (c) and (d) have no open

indices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.7 1D lattice with PBC: trace of the product of 8 matrices or lattice of 8 parti-

cles with PBC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.8 TN representation of a wave function: (a) is an MPS (b) is a PEPS and

(c) is some tensor network which fulfil the demand. . . . . . . . . . . . . . . . . 40

viii

LIST OF FIGURES

3.9 Area-law in PEPS: reduced state |A(α)⟩ 2×2 and |B(α)⟩ from 4×4 PEPS.

Each broken bond has dimension D and it contributes logD to the entangle-

ment entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.10 By increasing the bond dimension D of a TN state one can explore larger

region of Hilbert space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.11 (a) Infinite-sized MPS with 1 and 2 site unit cell. (b): An efficient way to

contract finite-sized MPS which can also be applied to infinite-sized MPS with

any boundary condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.12 Singular Value Decomposition (SVD) Eq. 3.13 in TN notation: here I ≡{i1, i2, ...in} and I ≡ { j1, j2, ... jn}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.13 Pictorial presentation of SVD performed in Eq. 3.16. . . . . . . . . . . . . . . . 46

3.14 Complete construction of an MPS by SVD. . . . . . . . . . . . . . . . . . . . . . 47

3.15 AKLT state: Each physical site represents spin−1 which is replaced by the

two spin−1/2 degrees of freedom (called auxiliary spins). Each right auxiliary

spin−1/2 on a site i is entangled to the left spin−1/2 at site i +1. Linear

projection operator is defined on the auxiliary spins which maps them to

physical spins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1 The RG decimation scheme: every spin has 4 nearest-neighbours and we

summed over half of the spins. The resulting lattice is same as original but

rotated at an angle 45◦. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 In coarse-graining process, result of removing the degrees of freedom we got

high degree of connectivity. Yellow lines show nearest-neighbour couplings

v1v2+v2v3+v3v4+v4v1, green lines represent next-nearest neighbours v1v3+v2v4 and blue connections depict all four spin product v1v3v2v4. . . . . . . . . 58

4.3 RG flow diagram for 2D Ising model: there are three fixed points, two are

stable (K = 0,∞) and one is unstable (Kc). Kc is a phase transition point. . . . 59

4.4 Block spin transformation: (a) 2×2 blocks are defined on the physical

spins vi to marginalize, (b) depicts the effective spins hi after marginalizing

the physical degrees of freedom, (c) side view of RG procedure is shown,

repeated application of RG transformations produce series of ensembles one

over the other. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

ix

LIST OF FIGURES

4.5 RG and deep learning aspect of 1D Ising model: (a) a coarse-grained

process by the renormalization transformation of the ferromagnetic 1D Ising

model. After every RG iteration, half of the spins marginalize and the lattice

spacing becomes double. At each level the system replaced by the new system

with relatively fewer degrees of freedom and new couplings K ’s. By the RG

flow equation, the couplings at the previous layer can provide the couplings

for the next layer. (b) The RG Coarse-graining can also be performed by the

deep learning architecture where weights/parameters between n and n−1

hidden layer are given by K (n−1). . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.6 Deep neural network for the 2D Ising model: (a) four-layered DBN is

constructed with size of each layer 1600,400,100, and 25 spins. The network

is trained over the samples generated from 2D Ising model with J = 0.405.

(b) The Effective Receptive field (ERF): encodes the effect of input spins to a

particular spin in the given layer, of the top layer. Each image size is 40×40

which depicts the ERF of a particular spin in the top layer. (c) The ERF gets

larger as we move from bottom to top layer of the network, thats consistent

with the successive block spin transformation. (d) Similarly ERF for the third

layer with 100 spins. (e) Three samples generated from the trained network. 66

5.1 Correspondence between RBM-TNS: (a)RBM representation as an undi-

rected graphical model as defined by Eq.5.1. The blue circles denote the units

v called visible, and gray circles labelled as h called hidden units. They inter-

act with each other through links denoted as solid lines. (b) MPS described

by 5.5. Each dark blue dot represents a 3 indexed tensor A(i). Now on we

use hollow circles to denote RBM units and use filled ones to indicate ten-

sors. undirected lines in the RBM represents the link weight and lines in the

MPS denotes the bond indices and thickness of the bond expresses the bond

dimension. RBM and TNS are used to represent complex multi-variable func-

tions, Both have the ability to describe any function with arbitrary precision

given unlimited resources (unlimited hidden variables or bond dimensions).

Although, provided the limited resources they can represent two overlapping

but independent regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

x

LIST OF FIGURES

5.2 Stepwise mapping an RBM to an MPS: (a) An MPS description of an

RBM given in Fig.5.1(a). The light blue circles express the diagonal tensor

Γ(i)v at the visible units defined by the Eq. 5.2 and gray circle used to denote

Γ( j)h at the hidden units as defined by Eq. 5.3. The orange diamonds express

the matrix M(i j), described in Eq. 5.4. (b) The RBM is divided into nv pieces.

Corresponding to each long-range link, put an identity tensor (red ellipse)

to subdivide M(i j) into two matrices. (c) An MPS is transformed from RBM

by summing up all the hidden units belonging to each piece in (b). The

number of cuts (link) made by the dashed vertical line is equal to the bond

degrees of freedom of the MPS. (d) The matrix M(41) corresponds to long-range

connection is broken into a product of two M1, M2 matrices, represented by

the light pink diamond. The red ellipse shows the product of two identity

matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 The optimum MPS representation of an RBM: (a)− (e) Depicts the step-

wise construction. The set X g is denoted by light yellow rectangle/triangle and

set Yg represents a dark blue rectangle. The set Zg which provides the condi-

tional independence of X g and Yg is represented as a light green rectangle.

When the set Zg is given, the RBM function which interpreted as probability

can be written as a product of two functions, one depends on X g and other

depends on Yg. The variables in Zg are defined as the virtual bond of the MPS.

The light gray lines show the connections included in the previous tensor. The

connections being considered in the current step are denoted by dotted lines

and these are represented as G t in Algo. 5.1.2. (f) The resulting MPS. . . . . . 73

5.4 (a) RBM after summing out entire set of hidden units. (b) The curved lines

represent the connections between visible units through hidden units. The

whole system is split into two parts X g and Yg and second one is further

split as Yg =Y1 ∪Y2. Where Y1 contains the visible variables that are directly

connected to X g. (b) The alternative way is expressed. (c) The MPS provided

by this method has smaller bond dimension as compared to Fig.5.2. . . . . . . 76

5.5 TNS to RBM transformation: graphical representation of (a) Eq. 5.12 and

(b) Eq. 5.13 and 5.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.6 The matrix A is defined in Eq. 5.19 has special form that represented by

the dashed box. The blue dots denote the identity matrices and square box

represent the left and right matrices. To obtain the RBM parameters we apply

transformation of MPS to RBM according to Eq.5.20. . . . . . . . . . . . . . . . 81

xi

LIST OF FIGURES

5.7 The RBM architecture which used to represent the cluster state. It has local

connections, each hidden unit connected with three visible units. . . . . . . . . 82

5.8 Effectiveness of deep network as compared to shallow network: (a)

An RBM and (b) a DBM, both have same nv (blue circles), nh = 3nv and

number of connections 9nv. The approach discussed in Sec.5.1 can be used

to represent both architectures as an MPS with bond dimensions for DBM

D = 24 and RBM has D = 22. Dashed rectangle depicts the minimum number

of visible units required to fix in order to cut the system into two subsystems.

The bond dimension shows that DBM can encode more entanglement as

opposed to RBM when equal number of hidden units and parameters are

provided. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

B.1 Tracing out even degrees of freedom. . . . . . . . . . . . . . . . . . . . . . . . . . 88

B.2 RG flow in coupling space, it depicts the stable and unstable fixed points. . . 89

B.3 RG flow: (a) coupling space in different domain from 0 to 1, (b)n the presence

of external magnetic field h. The arrows show flow direction and blue lines in

between two limits (K = 0,∞) depict flow and these end up on vertical axis

h where K = 0. “×” signs on the vertical axis when K = 0 represent the fixed

points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

xii

CH

AP

TE

R

1INTRODUCTION

The human brain is jammed packed with neurons and these neurons connected together

by links. A neuron is the fundamental unit of information processing in the brain. A

network of these neurons receives input in the form of signals and generates an output

after passing through its middle or so called hidden layers. The brain learns by changing

the connection strength between neurons. It is good in recognition or identification things

but finds difficulty in multiplying two matrices, complex integration, etc. Conventional

computers, in contrast, are good at performing a lot of calculations such as multiplication

of a huge set of numbers but very weak in recognition things. It turns out that the

ability of recognition can be put into computers by mimicking the human brain. Set of

methods and techniques in machine learning which mimic the brain is called artificial

neural networks (ANNs) sometime simply called neural networks. An ANN consists

of layers of neurons from few to many. In general, neural networks with three or less

than three layers, input, hidden and output layer called shallow neural networks while

neural networks with more than three layers are called deep neural networks (DNNs).

DNNs are more powerful than shallow networks and has produced groundbreaking

advances in automated translation, speech and image recognition, etc. The difference

between shallow and deep network is more than just number of layers. To make shallow

network learn, one has to do feature engineering: transforming the input in such a way

to have accurate input to output mapping. It increases human labor and makes some

tasks almost impossible. On the other hand, deep networks find a good representation

of a raw data by itself which enable them to learn better and perform well on the tasks

1

CHAPTER 1. INTRODUCTION

which shallow networks can’t do. This remarkable strength of deep networks is not free

of cost. Deep networks often require billions of parameters to learn and it makes them

computational extensive as compared to the shallow networks. But today’s computers

are good enough to simulate deep networks with many layers.

Despite DNNs remarkable success, there is a very less theoretical understanding of

the working of DNNs. Why are DNNs better than shallow networks? What is an under-

lying mechanism which makes them find a good representation of data? Recently, Carleo

and Troyer [1] made a great attempt to use an ANN as a representation of the quantum

state of a many-body system. They showed that the restricted Boltzmann machine, a

generative artificial neural network model commonly known as RBM, can be used as a

variational wave function for the quantum many-body system at or away from equilib-

rium. An RBM can be used to construct topological states showed by Dong-Ling Deng [2].

He also studied the Quantum entanglement properties of RBM and proved that short

and long range RBM states follow the area and volume law respectively [3]. Giacomo

Torlai and Roger G. Melko [4] showed that one can study the whole thermodynamics of a

statistical system by using RBM. Many different efforts have been made in this regard.

These developments lead us to some important questions about the representation power

of ANNs in the physics context. Is standard variational wave function approximation

less expressive than RBM? can we use RBM to study systems at criticality? In this study,

we address these questions in detail and try to understand DNNs underlying working

and their expressive power for deep learning tasks as well as for quantum states.

The RG is a mathematical machinery that allows systematic investigation of the

changes of a physical system as observed at different length scale. We will see how

coarse-graining process in real space RG related to DNN algorithms. To be more specific,

we will study an exact mapping from variational RG and RBM [5]. Tensor network

methods are widely used in the quantum many-body system, the relationship between

RBM and tensor network will also closely observed and try to quantify the expressive

power of RBM by using quantum information aspect [6].

1.1 Restricted Boltzmann machine

The neural network used in this regard is the restricted Boltzmann machine in short

RBM. An RBM was developed by the machine learning community with the motivation

of statistical physics. It is a generative stochastic neural network, used to automatically

detect the inherent patterns in the unlabelled data by reconstructing their input. Instead

2


of having output the right answer, it always tries to constantly learn to be able to

construct that data which is closest to the data it is being trained with. RBM can learn

its input’s probability distribution in an unsupervised or supervised manner, therefore,

labelled data is not an obligation to train them. This property makes them perfect fit for

the case, like feature learning, topic modelling, dimensional reduction, classification and

recommender systems and one can create DNNs by training them layer by layer in a

greedy manner. It’s a building block that can be used to create DNNs, for instance deep

belief networks (DBNs). It contains one hidden layer and one visible layer, the connection

between units are only between consecutive layers in other words there is no inter-layer

connection as shown in Fig.1.1., due to this restriction we call it restricted Boltzmann

machine. These connections between units form a bipartite graph i.e. a pair of layers

has symmetric connections. That means we take a sample and update the states of the

hidden layers and we reconstruct the visible units by updating hidden units again. We

go both ways to update weights instead of one-way. RBM is a member of energy based

models, energy of this network is given as

E(v,h)=−∑i

aivi −∑

jb jh j −

∑i, j

viWi, jh j. (1.1)

Similar to Boltzmann distribution in statistical physics, configurations with less

energy has high probability, RBM learning corresponds to finding an energy function

that map’s low energy configurations to the desired values and higher energies to the

inaccurate values. The goal is to minimize cross-entropy of the loss function and maximize

the product of probabilities assigned to some training set. The energy configuration of a

pair of the boolean vector v and h is shown in Eq. 1.1. v corresponds to visible units, hcorresponds to hidden units and a,b are biases to v and h respectively. The system starts

with producing random values and over time it calls to thermal equilibrium and produces

the dataset that has distribution close to the original dataset. This doesn’t mean that

the system will eventually settle down into the lowest energy configuration possible but

the probability distribution over all configurations settles down. RBMs are two-layered

neural networks that form the building blocks of deep belief networks (DBNs). An RBM

extracts higher-level features from low-level features, training them incrementally and

stacking them creates DBN.

3


Figure 1.1: The structure of Restricted Boltzmann machine.

1.2 Deep neural network in renormalization groupperspective

The renormalization group (RG) allows theoretical understanding of divergent prob-

lems in statistical physics and quantum field theory. In statistical physics, “real-space”renormalization group enables to deal with the phase transition and produces neat table

of critical exponents. There are other types of renormaliztion group procedures exist

but the idea is common in all RG studies i.e. re-expressing the parameters {K} which

define the problem in terms of other parameters {K′} (perhaps simpler) while keeping

constant those physical aspects of the problem which are of interest. This may achieve

through some kind of iterative coarse-graining degree of freedom at short-distance while

studying physics at long-distance. The goal is to extract relevant features which describe

system at larger length scale while marginalizing over short irrelevant features. In RG

procedure, starting with fine-grained system, marginalize fluctuations at microscopic

level and proceed towards macroscopic level to have coarse-grained system. During this

sequence, some features which needed to describe behaviour of system at macroscopic

level dubbed as relevant operators while irrelevant operators have a diminishing effect

on the system at macroscopic scale.

For most of the problems, it is very difficult to carry out renormalization group exactly.

So, physics community has developed many variational schemes to approximate RG

procedure. One such variational scheme, class of real-space, introduced by Kadanoff to

execute RG on a spin system. Kadanoff ’s RG scheme introduces coupling parameters

(unknown) which couple coarse-grained system to fine-grained system. This mapping

carries out as defining the auxiliary or hidden spins and couple them with physical spins

through unknown parameters. The free energy Fh, dependent on coupling parameters,

is determined for the coarse-grained system from the coupled system by marginalizing or

integrating out the physical spins. The coupling parameters are chosen(variationally) in

4


such a way that the difference between the free energies ∆F = Fh−Fv of the hidden spin

system and the physical spin system becomes minimum. The ∆F ∼ 0 ensures that long-

distance physical features are encoded in the coarse-grained system. The mathematical

structure of procedure carrying out in this way defines RG transformation which maps

physical spin system into hidden spin system preserving the long-distance physics. Next

round of renormalization carries out by serving the hidden spins as input.

To understand the DNNs success one perspective could be renormalization group

(RG). DNNs can be viewed as an iterative coarse-graining process, where each next

layer in the network learns high-level abstract feature representation of the data. As

already said, the output from the previous layer becomes input for next layer. Input-layer

receives low-level features in the form of raw data and next layer learns relevant and

more abstract features. This process continues until a raw data becomes an abstract

concept. By this feature extraction mechanism, DNNs learn relevant features and at the

same time ignore irrelevant ones.

This implies that DNN algorithms may be applying an RG-like scheme to learn

relevant features from data. The ANN model we use to show RG and DNN relation is

the RBM. We will see that RBM has exact map to variational RG [5].

1.3 Correspondence between RBM and TensorNetworks (MPS)

Solving the quantum many-body problem is a challenging task as the Hilbert space

corresponds to the system grows exponentially. The exponential complexity is the result

of non-trivial correlations encoded in the system. To tackle such a difficult problem, the

variational approach and sampling methods are usually adopted. A sampling method

consists of Monte Carlo while variational approach includes many examples from a

straightforward mean-field approximation to more involved methods such as those based

on tensor network states, and, a recent development, neural network states [1]. The

efficient description of a required quantum many-body state is the first and crucial step

of the variational method. An efficient description of the quantum many-body state is

defined as the number of parameters needed to distinguish those quantum states grow

polynomially with increase in the number of particles or degrees of freedom in the system.

After finding the efficient description, one can determine the variational parameters by

combining it with powerful learning methods, such as gradient descent.

5


Figure 1.2: Tranformation of RBM to MPS: (a) Graphical notation of the RBM definedby Eq.1.2. (b) The matrix product state (MPS) in graphical notation. Here dangling linkscorresponds to physical variables vi and A(i) is three index tensor. The thickness ofhorizontal link between tensors shows the virtual bond dimension.

ANNs are renowned for representing complex probability distribution or complex

correlations and recently find prominent success in many AI applications through the

popularity of DL methods. Numerical results show that an RBM, trained by the rein-

forcement learning technique, produces a quality solution to a wide variety of many-body

models [1]. The quantum state in RBM representation, an efficient, can be written

(without normalization factor) by considering the Fig. 1.1. with some bias vectors a,b to

visible and hidden layer respectively:

ΨRBM(v)=∑h

e−E(v,h)

=∏i

eaivi∏

j(1+ eb j+

∑i viWi j ),

(1.2)

where energy E(v,h) over network is given in Eq. 1.1.

On the other hand, Tensor Network (TN) methods provide an ultimate parametriza-

tion of a wave function to represent quantum state, exponential in its degrees of freedom,

with polynomial resources. Fig. 1.2(b) shows a simplest TN state: the matrix product

state (MPS). The MPS parametrized representation of a wave function of nv physical

spins or variables given as

ΨRBM(v)= Tr∏

iA(i)[vi]. (1.3)

6


Here A(i) is a tensor with three indices as depicted in Fig. 1.2(b). Physical variables

vi are represented by vertical dangling bonds. For a specific value of vi, the A(i)[vi] is a

matrix. The matrix dimension, commonly known as the virtual bond dimension of the

MPS, depicted by the thickness of the bond between two tensors in Fig. 1.2(b). After

specifying the value of vi, rest of the process is just contracting all virtual bonds and

taking the trace of the resulting matrix gives a number. This number represents the

coefficient(probability amplitude) of the specified physical state. MPS allows enhancing

its capability to represent complex multivariable functions by increasing bond dimension.

In the previous few decades, concrete understanding for TN in both theoretical and

numerical regime has been authenticated with many applications. We now also interpret

the effectiveness of TN. It is based on the entanglement entropy (EE) area-law: EE

increases with the size of the boundary separating between two systems. Low-energy

gapped Hamiltonian states follow this area-law. As a consequence of this fact, the degree

of freedom required to define a physical state (in which we interested) is exponentially

less than the entire set of degrees of freedom. The TN methods are developed to describe

such states, comparatively low entanglement states, and have accomplished exceptional

success in the previous few years.

RBM and TN are similar in their mathematical structure, in particular, expressed

in graphical language as shown in Fig. 1.2(a) and (b). The working of DNN methods

consists of finding good representation or extracting features, a tiny fraction of data as

compared to input, for example, PCA, Autoencoders, etc. This provokes us to investigate

for a guiding principle in the view of quantum information. A principle that can be used

as a guide to quantify the representation power of ANNs for deep learning as well as

for physics problems. Also, we will see that the RBM can be represented as TN and vice

versa, in fact, they are equivalent [6].

The main resources used for the MS Thesis are following: 2nd chapter is consist on

the book “Deep learning”[7] and paper “Learning Deep Architectures for AI” [8]. The

primary resources for 3rd chapter are the tensor network papers [9, 10]. The 4th and

5th chapters are merely based on the papers [5] and [6] respectively.

7

CH

AP

TE

R

2MACHINE LEARNING BASICS

Deep learning is a particular type of machine learning. Basic principles of machine

learning are crucial to understanding deep learning well. This chapter is devoted to

explaining basic building blocks of machine learning and particular type deep learning

models i.e. energy-based models.

2.1 Learning Algorithms

A machine learning algorithm can be defined loosely in just five words, “using data

to answer questions” or more better we can split the definition into two parts, “using

data” and “answer questions”. These two pieces broadly outline the two sides of machine

learning algorithm, both equally important. Using data is what we referred to as training

and answering questions is referred to as making predictions or inference. Conventional

definition of learning algorithm is provided by Tom Mitchell “A computer program issaid to learn from experience E with respect to some class of tasks T and performancemeasure P, if its performance at tasks in T, as measured by P, improves with experienceE.” For example, if we ask to write a learning algorithm to classify emails as spam or not

spam, what will be the task T, performance measure P and experience E? In this problem,

classifying emails as spam/not spam is task T, E will be watching classified examples

labelled by us as spam/not spam and P can be number of emails correctly classified by

learning algorithm. In the upcoming sections we attempt to describe E, P and T with

examples.

8

CHAPTER 2. MACHINE LEARNING BASICS

2.1.1 The Task, T

There are few AI tasks which can be hand-coded or explicitly programmed such as find

the shortest distance between two points. Some tasks are very difficult to solve because

they require intelligent behaviour and there is no way to perform these by explicit

programming. Machine learning not only enables us to solve such type of problems but

also increase our understanding about the principles and basis of intelligence.

The task is somewhat different than learning. Developing the ability to do a task is

the other way to define learning. The “Task” is simply a job. For example, if we want a

helicopter fly itself, then flying is the task. We can explicitly program the helicopter to

learn to fly but it is very difficult to write directly a program (which specify the process

of flying step by step) for flying.

A Task is a particular way which attains by machine learning system to process an

example: a set of features, provided to machine learning system for processing, which

qualitatively characterized by some event or object. We use x εℜn vector to denote an

example and xi represents a feature in an example. Few machine learning tasks are

listed below:

• Classification: Classification is simply the process of taking some kind of input

x and mapping it to some discrete k labels like true/false or 0/1. In classification

task, learning algorithm requires to produce a function f : ℜn → {1, ...,k} i.e. a

classifier which takes an input and outputs discrete label. There are other types of

a classification task for example, a binary classification is predicting a tumour is

malignant or benign, where each example is a feature vector x consist of patient’s

age, tumour size etc., and outputs 0/1 or probability distribution upon these two

classes. An other example can be object recognition, where input vector is an image

and output is a label which used to identify the object in an image.

• Regression: In this machine learning task, computer program learns to predict

real output y given some input x, where y is continuous. To solve this type of task,

learning algorithm requires to output a function f :ℜn →ℜ. An example of regres-

sion task is housing price prediction model where we given set of examples({x, y})

and computer program predicts real output on some unseen inputs.

• Estimate density or probability estimation: In this task, the density estimation

problem, a learning algorithm needs to learn a function pmodel : ℜn → ℜ. Here

pmodel is interpreted as probability mass function (if x is discrete) or probability

9


density function (if x is continuous) on the example space. Its a bit tricky to define

performance measure on such a task, we will explain this in incoming section.

The learning algorithm has to learn underlying structure of the data, it should

have to know where the example space is dense or where it is sparse. Density

estimation captures the distribution by seeing just examples. In this type of tasks

we desire to approximate true/empirical probability distribution ptrue and hope

pmodel ' ptrue. After learning, we can generate new examples, which also belongs

to given examples space, by approximating the true probability distribution.

We can have many other tasks as well which can be defined as machine learning

tasks, for example, denoising, Synthesis, and sampling, anomaly detection, imputation ofmissing values, classification with missing values, etc.

2.1.2 The Performance measure, P

After defining our task T we need to figure out how to measure the performance P of the

learning algorithm that is we need a metric. Performance measure depends upon the

task being carried out by learning algorithm. For example, in classification problems,

accuracy is a commonly used metric. The accuracy of a classifier is defined as the number

of correct predictions divided by the total number of data points. This begs the question

though which data do we use to compute accuracy well. What we are really interested in

is how well our model will perform on new unseen data examples. We could compute

the accuracy of the data during the fitting process of the classifier. However, as given

data was used to train it, the classifier’s performance will not be indicative of how well

it can generalize to unseen data. For this reason it’s usual convention to split the data

into two sets a training set and a test set. We trained or fit the classifier on the training

set, make predictions on the labelled test set and compare these predictions with the

known labels. In this case, this difference defines the accuracy of the classifier. We can

define error rate as a metric for performance and refer it to 0−1 loss. The loss is 0 if an

example correctly classified, otherwise 1.

Defining performance metric for a particular task is not a straightforward thing

infect very difficult problem itself. For instant, in density estimation task we know the

quantity we would like to measure, but measuring it is infeasible. It makes no sense to

use metric like accuracy and error rate for probability estimation task. We should have

some other type of performance measure which outputs continuous value for each data

example. The common practice is to report average log-probability of examples assigned

10


by the model. Computation of actual probability in which every point in space assigned a

probability value is often intractable.

2.1.3 The Experience, E

After defining the task T and performance measure P, let’s drill into the experience E

in machine learning process. Usually, there are two major types of machine learning

algorithms, supervised and unsupervised. This category based on what sort of experience

a machine learning algorithm allowed to undergo the machine learning process. Almost

all the learning algorithms experience whole dataset i.e. a collection of many many

examples. Iris dataset [11] reciprocity has many datasets which are publicly available.

For example, the Adult dataset which consists of 48842 examples and each example has

14 features like age, occupation, capital-gain, capital-loss, etc.

• Unsupervised : learning algorithms learn properties of the structure of the dataset

by experiencing the whole dataset. In deep learning, our goal is to approximate

the true probability distribution through that dataset is generated. One example

of such a task is density estimation. clustering is an unsupervised algorithm, in

which learning system divide the dataset into cluster or group of similar examples.

The k-means is an example of clustering algorithm.

• Supervised: learning algorithms require datasets with labelled examples. Mean-

ing each example has a label or target, as a feature, value which we need to

predict. For example, MNIST dataset [12] widely used for hand-written digit recog-

nition system. It contains 60,000 hand-written digits example as 28×28 gray scale

pictures with true labels.

The difference between two types is clear, supervised learning algorithms have

supervision in terms of target values while unsupervised learning algorithms have no

supervision. In the former case, learning algorithm finds optimum parameters those

give us a good input x to output y mapping. And in the latter case, learning algorithm

emphasize to learn underlying probability distribution of data and have an unlabelled

dataset.

Although we have defined learning algorithms in two categories but this distinction is

not clear. Because supervised tasks can be learned by unsupervised learning technologies

and vice-versa. It often helps to roughly categories learning algorithms. In practice,

11


regression and classification tasks considered as supervised while density estimation as

unsupervised.

We can have other categories of learning algorithms as well. For example,Semi-supervised learning algorithm in which we have partially labelled dataset i.e. some

examples are labelled and remaining are not. The other example is reinforcement learn-

ing, in this learning process, learning system doesn’t experience the whole dataset and

has a feedback mechanism.

There are various ways to define a dataset common one is design matrix. In this

representation of data we have rows and columns, rows correspond to an example and

columns corresponds to features. In design matrix, we have all examples with equal

size/the number of features but this always not the case. For instance, if dataset consists

of pictures and each picture have different dimension then we have to find another way

to represent the data.

2.2 Linear regression

In linear regression problem, goal or “Task T” is to develop a system which take an

example in the form of vector x ∈ℜn as input and predicts a scalar y ∈ℜ as an output. In

linear regression task, as the name implies, our model is the linear function of the input.

Let’s hw is our model, predicts value which should be an output of given example.

hw (x )=wTx

=n∑i

wixi,(2.1)

where w ∈ℜn is a vector of parameters. In above equation, the weighted sum of input

x corresponds to the effectiveness of a particular feature let’s say xi on the output y.

So, mathematically, learning consist of finding optimum parameters w such that error

between value predicted by our model hw and true value y. The performance measure Pcan be mean square error on test set. It should be on unseen examples, for that purpose

we define design matrix X(test) with mtest examples in which target values will be y(test).

So idea is, we train the model on {X(train),Y(train)} training dataset and {X(test),Y(test)}

test dataset tell us the model has learned good parameters or not for gernalization.

J(w) = 12mtest

∑i

(h(test)i − y(test)

i )2, (2.2)

12


where J is our mean square error function and mtest is the size of test dataset.

To find the optimum parameters we need to design a learning algorithm that learns

or improves the parameters by reducing the J by gaining the experience of training

dataset. One obvious way of doing this is just to minimize J: ∇wJ = 0 on training dataset.

One can solve this gradient analytically and get the following :

w= (XTX)−1XTY. (2.3)

Above solution is nice and elegant but it has limitations: (XTX)−1 inverse should exist

and features space should be not too large. Normal equation solution Eq.2.3 is preferred

when we have a small dataset or feature space otherwise we need some other way to find

the optimum parameters such as gradient based optimization. And that algorithm is the

topic of next section.

2.3 Gradient descent algorithm

The optimization technique is central to machine learning algorithms. Optimization

defined as maximizing or minimizing some function, and the function needs to optimize

in learning algorithm is J(w). Almost every learning algorithm use some sort of opti-

mization method to find optimum parameters. Most of them use gradient descent (GD)algorithm, in which derivative is used to move towards the minima of parameter surface.

The algorithm consists of just two steps, take the derivative of error function J(w) and

modify the parameters according to that as shown in Fig. 2.1. Let’s take a concrete

example, consider a linear regression model with w parameters. The GD algorithm used

to update the parameters as

w j = w j −α∂J(w0,w1, · · · ,wN)∂w j

; ∀ j ε {0,1,2, · · · , N} (2.4)

where α is learning rate. Here w j should be updated simultaneously. GD algorithm

does not guarantee to reach the global minimum, it can be stuck at local minima. In this

case, one can add additional terms in the GD algorithm such as momentum term.

2.3.1 Stochastic Gradient Descent

Stochastic gradient descent is a very important extension to gradient descent algorithm.

Almost all the deep learning is powered by this algorithm. A big dataset is necessary for

best generalization but at the same time, it is computationally expensive.

13


Figure 2.1: Gradient descent: initialize w randomly, take gradient on this point andnext point will be updated according to the gradient. If the gradient is negative as shownhere then updated value of w will be greater by the factor α∂wJ. If the gradient ispositive then next value of w will be less by the same factor. GD algorithm takes smallsteps in this way and reaches to the minima.

Idea is, decompose the cost function as sum of training examples of any per-example

cost function. For instance, the negative conditional log-likelihood of data can be ex-

pressed as

J(w)= 1m

m∑i=1

L(x(i), y(i),w), (2.5)

where L is defined as L(x, y,w)=−logP(y | x,w), per-example loss. For optimization, GD

needs to compute

∂J(w)∂w

= 1m

m∑i=1

∂L(x(i), y(i),w)∂w

. (2.6)

The computational cost increases linearly with the size of data i.e. O(m). Notice

that the gradient descent, after decomposition, is the expectation of gradients. So the

expectation can be estimated by the small set of examples. Particularly, on each step of

training we uniformly sample a small training set B = {x(1), · · · ,x(m′)} where m

′is much

smaller than the total training size.

The gradient estimation is performed as

q= 1m′

m′∑

i=1L(x(i), y(i),w) (2.7)

and parameter update as

w=w−αq,

where α is learning rate.

In general, gradient descent algorithm is very slow and unreliable, it can be stuck

at local minima. But it is still useful because it provides sufficiently low test error in a

14


reasonable amount of time. SGD is independent of training size m and converges very

quickly. Also, it provides a constant cost of training a model as a function of m.

2.4 Maximum Likelihood Estimation

Point estimation is a way to predict some quantity: a parameter or a set of parameters,

of interest in a single attempt. Sometime we estimate function f (x) (by assumption there

exist a relation between input x and output y such as y= f (x)+ε) instead of parameter

of a function.

Besides these two approaches, there is a principle by using that we can derive partic-

ular functions that have ability to good estimate different models. The most common one

is Maximum Likelihood (ML).

Given a dataset X of m examples that drawn, of course independently, from unknown

but true probability distribution Pdata (x). Assume a Pmodel (x;w) a family parametrized

by w over same space x. It estimates the true probability Pdata(x) by providing a mapping

between a configuration x to a real number. The ML estimator for w is defined as

wML = max iw

mize Pmodel (X; w)

= max iw

mizem∏

i=1Pmodel(x(i); w). (2.8)

In practice, the product of probabilities is not appropriate for many reasons such as

numeric underflow. The log of above expression leads us to sum of log probabilities which

is easy to compute and has same behaviour as the product of probabilities.

wML = max iw

mizem∑

i=1log pmodel(x(i); w). (2.9)

Above equations can be transform into an expectation w.r.t. true distribution Pdata

by dividing with m:

wML = max iw

mize ⟨logPmodel(x(i); w)⟩x∼Pdata . (2.10)

One way to think about maximum likelihood estimation as the difference between

true/empirical distribution Pdata and estimated distribution Pmodel and this is measured

by Kullback‚ÄìLeibler divergence in-short KL divergence. This is given by

15


DKL(Pdata ∥ Pmodel)= ⟨[logPdata(x)− logPmodel(x)]⟩x∼Pdata . (2.11)

Notice that the first term just depends on data generation process not on the model so

we can train the model by just minimizing

−⟨[log pmodel(x)]⟩x∼pdata , (2.12)

this is same as maximizing Eq.2.10. Thus minimizing negative log-likelihood is equal to

maximizing likelihood. We will use this to train the restricted Boltzmann machine.

2.4.1 Conditional Log-Likelihood

We can estimate conditional probability P(y | x; w) by generalizing the maximum likeli-

hood. The conditional probability is the fundamental to supervised learning algorithms.

P(y | x; w) interpreted as probability of y given x, parametrized by w. It is given by

wML = max iw

mizem∑

i=1log pmodel(y(i) | x(i); w), (2.13)

where y(i) is i−th target value of corresponding example.

2.5 Logistic Regression or Classification

The linear regression problem is all about to predict real value by using linear model. But

there are many problems in which one needs to predict discrete values such as spam/not

spam , tumour/not tumour, etc. We can’t use linear regression model in this case by

introducing some threshold as output = 1 if h(x)> 0.5 otherwise 0. For this classification

problem we use sigmoid function that provides a smooth value from 0 to 1. The sigmoid

function is given as

σ(x)= 11+ e−x . (2.14)

Consider a dataset X consists of m examples with target value y, and task is to predict

a binary output. We use sigmoid function as our model and interpret it as probability of

being 1. For this model the cost function cannot be same as before, it will not a convex

function for this model. So we define a cost function from log-likelihood as

J(w)= 1m

[m∑

i=1y(i) loghw(x(i))+ (1− y(i)) log(1−hw)

], (2.15)

16


Figure 2.2: Sigmoid function: this function outputs a smooth value between 0 and 1.

where

hw(x)= 1

1+ e−wTx. (2.16)

Use gradient descent algorithm to optimize this cost function. We will get exactly

same form as we have for linear regression (Eq. 2.4) but with different model. So,

algorithm is exactly same Eg. 2.4, take derivative and update the values of parameters

simultaneously.

2.6 Overfitting, Underfitting, and Regularization

The difference between optimization problem and learning algorithm is that its ability to

generalize to new unseen examples; its performance on unseen data. In training process,

we emphasize on reducing the training error but we do more care about test error. The

characteristics of a good machine learning algorithm are its capability to make:

• training error small.

• the difference between test and training error small.

If a model does not respect the first property then it said to be underfitting. And

the violation of second property leads to overfitting-as shown in Fig.2.4. We can control

the fitting ability of a model by shrinking or expanding its hypothesis space: space of

function which can be solution. For instance, set of all the linear functions are included

in hypothesis space of the linear regression, as h = b+wx, by increasing the degree of

polynomial we expand hypothesis space as h = b+w1x+w2x2. So, linear models have

less capacity as compared to quadratic.

How one can decide the optimum capacity for a particular model? The principle of

parsimony (extend to statistical learning theory by Vapnik [13]) can be used in this

regard, it states that “among competing hypothesis that explain known observations

17


Figure 2.3: Overfitting and underfitting: (a) underfitting problem is occured becausedata is more complex than a linear model; (b) cubic model is appropriate for the provideddata; and (c) overfitting, polynomial of degree 5 is used as a model which is more complexthan given data.

equally well, we should chose the simplest one.” In particular, we make capacity of

model sufficiently large and keep regularizing the learning parameters. Modify the error

function as

J(w) = 12mtrain

∑i

(h(train)i − y(train)

i )2 +λwTw, (2.17)

where λ is the regularization constant. It restrict the value of learning parameters w in

an interval and prevent the model to over/under fit as shown in Fig.2.4. The training

process consists of minimizing J(w) so, if we define the value of λ very large then w will

be very small and vice verse.

Figure 2.4: Effect of λ on the model: (a) λ value is too large which makes w→ 0; (b)moderate value of λ is used; (c) when λ→ 0 then w becomes very large and model willbe more complex than data.

18


2.7 Neural Networks

Biologically inspired neural networks are programming paradigms which allows a com-

puter to learn from observing data. And deep learning consists of powerful techniques

for learning in the neural networks domain. To understand the neural networks first

look at the single Perceptron. A Perceptron takes a weighted binary input and outputs a

binary value. If the weighted sum∑N

i=1 wixi is equal or greater than some value, lets call

it threshold, then the output will be 1 otherwise 0 as shown in Fig. 2.5(a). On the other

hand, the model which takes weighted input but outputs a smooth real value between 0

and 1, this is called sigmoid neuron. A sigmoid neuron use sigmoid function defined in

Eq.2.14 and its plot is shown in Fig. 2.5(b).

Figure 2.5: Neuron’s simple model:(a) Perceptron: takes weighted binary input andoutputs a binary value; (b) sigmoid neuron it also take weighted input but outputssmooth real value between 0 and 1.

A network of Perceptrons can be formed by arranging the layers of these basic units.

Each unit in layer takes weighted input from the previous layer and generate an output

which acts as input for next layer, this network is called Multi Layered Perceptron (MLP).

Notice that sigmoid neuron is different from Perceptron due to just output function. The

network of sigmoid neurons can be constructed in a similar way. In practice, the neural

network with sigmoid output functions (activation of a neuron) is more useful than

Perceptron because many interesting functions which we want to approximate are non-

linear. A useful way to think about neural networks is they are function approximators.

The architecture we have discussed is usually used in supervised fashion so target values

y are given. Thus, the task of the neural network here is to find the input to output

mapping y = f (X;W) by finding optimum weights W (here W is the matrix). We will

discuss these networks in the next section.

In the case of unsupervised learning, different architectures of the network are used

for example sigmoid belief networks and Deep Belief Networks (DBN) from the deep

19


generative architecture domain. After training, these graphical models used to generate

examples similar to training data. And the training task in these models is to find joint

probability distribution over the network. Here we will just explain the DBN and its

building block, the RBM.

Neural networks consist of many layers, the number of layers called depth of network.

And the number of neurons in a layer called width of a network. The questions like

how to choose depth and width of a network for a particular problem is still not well

understood theoretically. The selection of these parameters, learning rate, and few others

(called hyper parameters) can be done by experimentation on a particular data.

2.8 Deep Feed Forward networks

These networks are used to approximate any function f and find a good mapping between

input x and output y. The information flows from input x to function f which consists

of many layers to output y. There is no feedback mechanism in the model if a feedback

mechanism is defined in the network itself is called recurrent neural networks. These

networks are powerful to learn non-linear functions and have a large capacity. It is

observed that the depth of network is crucial to learn complex and highly non-linear

functions. A simple example could be XOR function that cannot be learned by linear

models. Deep feedforward networks consist of many layers of representations. The layers

in between the input layer and the output layers called hi hidden layers (here we devoted

h for hidden layer not hypothesis). The network model can be represented in an acyclic

graph as shown in Fig.2.6. For example, network consists on two layers h(1) and h(2)

connected in a chain then the function approximator is f (x)=h(2)(h(1)(x)).

Figure 2.6: Deep feed forward neural network: network consists of l layers and eachlayer has sn units.

20


A usual set of equations for these feed forward multi-layer neural network is following:

hk = sigm(bk +Wkhk−1), (2.18)

where x = h0. And bk is a bias vector, Wk weight matrix for layer k and sigm is the

sigmoid function and it applied element-wise on given vector. The top layer hl is used to

predict and is combined with the true target values y into convex loss function L(hl ,y) in

bk +Wkhk−1. The output layer can have different activation function from hidden layers,

for example softmax function,

hli =

ebli+W l

i hk−1

∑j ebl

j+W lj hk−1

. (2.19)

Here hli is positive and sum of units in the layer is equal to 1 so it can be interpreted as

P(Y = i | x) that Y is the target class corresponds to input vector x, loss of this unit is

L(hl , y)=− logP(Y = y | x)=− loghli. The training comprises of minimizing this cost and

the total cost with regularization of a network which consist of l layers and NQ output

units is given as

J(W)=− 1m

[m∑

i=1

NQ∑q=1

y(i)q log(hW (x(i)))q + (1− y(i)

q ) log(1−hW (x(i)))q

]+ λ

2m

l−1∑p=1

sn∑i=1

sn+1∑j=1

(W pj,i)

2,

(2.20)

where sn is the number of units in n-th layer and λ is regularization parameter.

This cost function can be optimized by an algorithm called backpropagation. The

backpropagation algorithm is an efficient way to calculate derivatives. Usually, training

the deep network is a difficult task due to various challenges such as randomly initialized

weights in a gradient-based learning result a poor generalization because it stuck in the

local minima. One more problem could be vanishing gradient at the deeper layer, the

error surface for deeper layers becomes flat and the gradient approaches zero.

2.9 Unsupervised Learning for Deep NeuralNetworks

Layer-wise unsupervised learning has been a key component behind the success of

learning for the deep architecture so far. If the gradient in a deep architecture becomes

less useful as it propagates toward input layer, then it is easy to believe that a mechanism

21


(of the gradient) established at a single layer could be used to progress its parameters

in a right direction and one can get rid of the dependency on the unreliable update

gradient direction provided by supervised learning. Moreover, the advantage of using

unsupervised learning algorithms at each level of the deep neural network is that it could

be a way to break the task into sub-tasks corresponds to various levels of abstraction

and these levels extract implicit features from input distribution. The first layer of

deep architecture learns these silent features but due to a limited capacity of the layer,

features dubbed as low-level features. This layer becomes the input for next layer, by

using the same principle the higher-level features computed by the second layer and this

process go on and on until a high level abstraction emerges from input. This criteria

also keeps learning local to each layer. These strategies encourage us to discuss deep

generative models.

2.10 Energy-based models and RestrictedBoltzmann Machine

The Restricted Boltzmann machine is an energy based model and the building block

of DBNs. We will discuss here the mathematical concepts needed to understand these

models, including Gibbs sampling and Contrastive Divergence (CD).

2.10.1 Energy based models

Energy-based models borrow the concepts from statistical physics. In these models, a

number is assigned to every configuration (state) of the system and that number is called

the energy of the state. The learning task here is to adjust the energy function to have

desired properties. Energy E is used to define probability of a particular configuration as

P(v)= e−E(v)

Z, (2.21)

low energy configuration has high probability, and Z =∑v e−E(v) is a partition function.

Above expression also shows that energy act in the domain of log probability. For any

exponential family distribution, it is easy to calculate conditional probability distribution,

we will see this in RBM.

A particular form of energy, sum of terms, is used in the product of experts formulation.

Each term is corresponds to an “expert” fi, for energy E(v)=∑i f i(v):

22


P(v)∝∏i

Pi(v)∝∏i

e−fi(v). (2.22)

Here we devoted v for input or visible layer. Each expert can be thought as a detector

of improbable structure/configuration of input v, similarly, as imposing constraints on

v. For instance, if fi can take two values, then it has small value corresponds to the

configuration that satisfies constraint and large value which does not. We can compute

the gradient of logP(v) in Eq.2.22 by using the one instantiation of the CD algorithm.

2.10.2 Hidden Variables

In many tasks of interest, an input v consists of many components vi but we do not want

to observe these or the aim is to increase the capacity of the model, both tasks can be

accomplished by introducing the hidden-variables in the model. So observed part v and

hidden part h

P(v,h)= e−E(v,h)

Z, (2.23)

but we just observe v

P(v)=∑h

e−E(v,h)

Z. (2.24)

This formulation can be restore to previous one given in Eq.2.21 by introducing free

energy F =− log∑

h e−E(v,h), so

P(v)= e−F(v)

Z, (2.25)

but here Z =∑v e−F(v).

Thus a sum of energies E in the logarithmic domain is the free energy F.

2.10.3 Conditional Models

Learning algorithms which consist of partition function are very difficult to train because

of computation of all the configurations in each iteration of the training. If the task is

to predict a class y given an input v then it is sufficient to consider just configurations

of output y for each input v. Usually y belongs to a small set of discrete values. The

conditional distribution is given as

P(y | v)= e−E(v,y)∑y e−E(v,y) . (2.26)

23


In these models, the training process, in particular, the gradient of a conditional log-

likelihood is easy (in terms of computation cost) to compute. These concepts are used to

implement a type of RBM called Discriminative RBM [14], we will code this model in our

example. These models give rise to a conditional function that picks up a y for a provided

input which is the purpose of the application. Indeed, when y consists on a small set of

values then P(y | v) always computable because the normalization of the energy function

E requires only possible values of y.

2.10.4 Boltzmann Machine

The Boltzmann machine is an extension to the Hopfield networks. This specific type of

energy-based model composed of hidden variables. And the introduction of a restriction

in this model leads to an efficient and widely used model called RBM. The energy of the

Boltzmann machine is given as follows:

E(v,h)=−aTv−bTh−hTWv−vTUv−hT Mh. (2.27)

Here W is a weight matrix between v and h, U contains connection between v’s, similarly,

M connects hidden variables h’s. And a,b are biased vectors, U and W are symmetric.

The structure of Boltzmann machine is shown in Fig.2.7.

Figure 2.7: Boltzmann machine: an undirected graphical model.

This model involves quadratic terms in h that’s why analytical calculations of free

energy is not possible. The stochastic estimate can be obtained by Markov Chain Monte

Carlo (MCMC) sampling. Starting from the Eq.2.24, the gradient of the log-likelihood is

∂ logP(v)∂t

=−∑h

P(h | v)∂E(v,h)

∂t+∑

v,hP(v,h)

∂E(v,h)∂t

, (2.28)

24


where t = {ai,b j,Wi j,Ui j, Mi j}. Notice that partial derivative is easy to compute. So,

if there is a procedure to sample from P(v | h) and P(v,h) then gradient of the log-

likelihood can be estimated stochastically. In 1986 Hinton and others [15] introduced the

terminology of positive phase and negative phase. In the sampling process, the positivephase corresponds to sampling h while v is clamped to input vector. The sampling of (v,h)

from model itself is called negative phase. In practice, only approximate sampling can be

done, for example, construct an MCMC by iterative process. Hinton also introduced an

MCMC established on the Gibbs sampling. The Gibbs sampling for N combined random

variables can be done in a series of N sampling sub-steps. Consider S = (S1, · · · ,SN) is a

set of N random variables, the sampling sub-steps for these

Si ∼ P(Si | S−i = si). (2.29)

Here Si is a variable that we want to sample and S−i are remaining N −1 variables.

A step of the Markov chain is completed after sampling all the N variables. Under

some condition, e.g. aperiodic and irreducible, the Markov chain converges to P(S) after

infinite steps. Gibbs sampling can be performed on Boltzmann machine if we denote all

the visible and hidden units by s and calculate P(si | s−i), we will see that in the next

section, this conditional probability distribution is just sigmoid function that takes input

from the previous layer of the neurons s−i.

Notice that the computational cost of the gradient is very high as we need to run

two Markov chains for each example. This is the reason for the downfall of Boltzmann

machine in the late 80’s. In the last decade, the machine learning community finds its

way out for training the Boltzmann machine. It turns out that the short Markov chains

are also useful and this is the principle of the CD.

2.10.5 Restricted Boltzmann Machine

The Restricted Boltzmann Machine (RBM) is fundamental to Deep Belief Networks

(DBNs) because a DBN is just a stack of RBMs. The RBM is a special type of Boltzmann

machine with a restriction that “no intra-layer connection exist for both visible and

hidden layers”, and it can be trained efficiently. Due to this restriction, the hidden units

are independent from each other when v is given and visible units are independent from

each other when h is given. Two layers, hidden and visible, can interact with each other

through a weight matrix W as shown in Fig.2.8. The energy function becomes simpler

under this restriction as follows:

25


Figure 2.8: Restricted Boltzmann machine (RBM): no intra-layer connection buthidden and visible units can interact with each other.

E(v,h)=−aTv−bTh−hTWv. (2.30)

The probability distribution can be calculated by the same formula given in Eq.2.21.

The purpose of training RBM is to approximate probability distribution over observed

variables v. The conditional probability distribution can be expressed as the factorial

distribution

P(h | v)= eaTv+bTh+hTWv∑h′ eaTv+bTh′+h′TWv

=∏

i ebihi+hiWiv∏i∑

h′iebih

′i+h′

iWiv

=∏i

P(hi | v).

(2.31)

In the usual models (the hi is binary), we ended up on familiar neuron activation

function.

P(hi = 1 | v)= ebi+Wiv

1+ ebi+Wiv= sigm(bi +Wiv), (2.32)

where Wi corresponds to i-th row of W . Similar calculations can be done for probability

of v when h is given and the result is same

P(v |h)=∏i

P(vi |h), (2.33)

and for the binary units

P(vi = 1 |h)= sigm(a j +WT. j h). (2.34)

Here W. j corresponds to j-th column of W. Binary units are useful when the task is to

approximate the binomial distribution such as gray scale pictures. For continues valued

26


Figure 2.9: Softplus function.

dataset, the Gaussian input units are recommended. The RBM is less efficient for some

distributions as compared to Boltzmann machine but still, the RBMs can approximate

any discrete distribution when sufficient hidden units are provided. Interestingly, it can

be shown that adding extra hidden units in already learned RBM can always improve

results.

We can write the probability distribution in terms of the free energy. In that form,

training process is more intuitive. Starting from the marginal distribution

P(v)=∑h

P(v,h)=∑h

e−E(v,h)

Z

= eaTv+∑j sof tplus(b j+Wjv)

Z

= e−F(v)

Z,

(2.35)

where sof tplus(x)= log(1+ ex) is shown in Fig.2.9 and the free energy is

F(v)=−aTv−∑j

sof tplus(b j +Wjv). (2.36)

From above equation it is clear that training corresponds to rows of W along their bias

which are such that sof tplus term tends to be high. The free energy expression allows

us to know that what RBM needs to do to make certain inputs have high probability.

What its need to do to make training set more likely i.e. P(v) big on the training dataset.

To train the RBM we need to compute the gradient of the log-likelihood Eq.2.28.

2.10.6 Gibbs Sampling in RBMs

Sampling from an RBM is useful in the learning process as it is used to estimate the

gradient of the log-likelihood and generated samples can be used to inspect whether

27


Figure 2.10: Markov chain: starting with an example vt sampled from the empiricalprobability distribution. Sample hidden layer given visible layer and then sample visiblegiven hidden layer, this process go on and on.

RBM has captured the underlying data distribution or not. As we will see that sampling

from an RBM allows us to sample from DBN.

Gibbs sampling in the unrestricted Boltzmann machine is not efficient because of the

sub-steps in the Gibbs chain proportional to the units in the network. In contrast, we

have an analytical expression for positive phase because of factorization and the visible

and hidden layer can be sampled in two steps of the chain. Starting with an example

from given dataset vt, sample h1 with probability P(h1 | vt) then repeat this step but this

time hidden layer h1 is given, sample v1 this process is shown in Fig.2.10. Experiments

show that starting Markov chain with an example from dataset makes the convergence

of the chain faster. It also makes sense as we want the probability distribution of our

model similar as empirical probability distribution of the data.

In Contrastive Divergence (CD) algorithm, we use this approach and take negative

sample after just k steps of Markov chain, CD algorithm is given in Algo. 2.10.6, here

k = 1. As far as the code implementation of RBM is concerned, we have used the Matlab

code given at [18].

2.11 Deep Belief Networks

A deep belief network (DBN) is a stack of RBMs which models the combine distribution

over the visible and many hidden layers [17]. The probability over the DBN which

consists of the l layers, one visible and l−1 hidden as given

P(v,h1, · · · ,hl)=(

l−2∏k=0

P(hk |hk+1

)P(hl−1,hl), (2.37)

28


Algorithm 1 This is parameter update procedure for binomial RBM.1: RBMupdate (v1,α,W ,a,b)2: vt : given example from empirical distribution3: α : learning rate4: W : weight matrix Nhidden ×Nvisible for RBM5: a: bias vector for visible units6: b: bias vector for hidden units7: Notation: H(h2 = 1 | v2) it the vector containing elements H(h2i = 1 | v2)8: for all hidden units i do9: compute H(h1i = 1 | vt) from sigm(bi +∑

j Wi jv1 j)10: sample h1i ∈ {0,1} from H(h1i | vt)11: end for12: for all visible units j do13: compute P(v2i = 1 |h1) from sigm(a j +∑

i Wi jh1i)14: sample v2 j ∈ {0,1} from P(v2i = 1 |h1)15: end for16: for all hidden units i do17: compute H(h2i = 1 | v2) from sigm(bi +∑

j Wi jv2 j)18: end for19: W ←W +α(h1vT

1 −H(h2 = 1 | v2)vT2 )

20: a← a+α(v1 −v2)21: b←b+α(h1 −H(h2 = 1 | v2))

where v = h0, P(hl−1,hl) is the distribution of RBM at the top and P(hk | hk+1) is

conditional probability of hk given hk+1. The DBN structure is depicted in Fig.2.11.

The conditional probability distribution and the RBM at the top define the generative

model. From now on we use Q for approximate posterior or exact probability of this

generative model, which are used to infer and training. The Q posterior is equal to true

distribution P at the top because these two form an RBM and it is approximate for other

layers. For code implementation of the DBN, interested user can use [18]. Where DBN is

trained for MNIST dataset but code is easy to modify for a particular purpose.

29


Figure 2.11: Deep Belief Network is defined as a generative model, the generativepath is from top to bottom with distributions P, and Q distributions extract multiplefeatures from the input and construct an abstract representation. The top two layersdefine an RBM.

30

CH

AP

TE

R

3TENSOR NETWORKS

Tensor networks (TNs) are motivated from starting systems in condensed matter physics,

high energy physics, maybe also quantum chemistry and have produced several bench-

mark results in many directions. TN becomes an essential computational tool in the

study of the quantum many-body system. Also, it opens up new directions, such as its

relation to the Ads/CFT correspondence in quantum gravity and the holographic princi-

ple. Nowadays, TN is rapidly growing field in which researchers aim to study complex

quantum systems from a point of view of their entanglement and trying to understand

what entanglement theory can teach us about the structure of these systems. And how

we can use it to understand these systems better.

In all these fields, we have many-particle systems and these particles interact with

each other quantum mechanically. So, we won’t be able, in a general setting, to describe

these systems by simply studying the quantum mechanics of a single particle. We will

have to consider all these particles together, the way they interact and become the

joint quantum state. If we really want to address this problem fully, we will have to

consider the quantum correlations in particles, essentially the entanglement between

these particles. A problem like that where we don’t have any extra information is really

a daunting task. In a many-particle system, the number of degrees of freedom grows

exponentially with the number of particles. It seems there’s no structure at all in that but

of course, that’s not the case. In a real matter, there is structure, particles which are close

by will interact strongly on the other hand particles which are distant from each other

will interact very weakly. So, there is some notion of locality and spatial structure. And

31

CHAPTER 3. TENSOR NETWORKS

it’s really this interplay which allows us to say something, how these kinds of systems

behave and lead us to the fact that there is some structure in these systems, the type of

entanglement they display. It makes entanglement still very special, despite being some

complex quantum entanglement, that allows us to study systems from the point of view

of entanglement.

For most of the matter around us, we don’t really have to think so much about the

quantum interactions as the entanglement is not that important in most of the systems.

Many systems, usually, described by what is called mean field theory and mean field

theory basically neglects entanglement and one can think of entanglement as something

which is a corrective level on top of something which is not entangled or not in a non-

trivial way. This approach is vastly successful; it describes most of the systems we have

around us. And one reason could be that many of the systems around us are at high

temperatures as compared to the well interaction strengths. At high temperature, states

become quite mixed and, in that case, entanglement becomes less and less important

than mixture state. We are not interested in the systems like that. Here, we would like

to look at the systems where the quantum correlations play an essential role, which we

can’t understand by neglecting entanglement. We are most interested in a regime of

low temperature and the reason is essential that, in higher temperature systems, the

state of the system becomes more mixed and less entangle. If we have a state at infinite

temperature i.e. a maximally mixed state or very less entangled, the state has no intent

just everything is completely random. Therefore, we would expect naturally that the

lower we choose our temperature the more quantum the system will behave. So, to see

the most interesting physics we should look at very low temperatures or maybe ideally

the ground state.

We will try to focus on a simpler kind of systems which is also closer to quantum

information theory, but they still display many of the features we are interested in.

Especially those which make the quantum matter special and these are quantum spin

systems on a lattice.

Understanding the many-body system quantum mechanically may be a most chal-

lenging and difficult task in many-body physics. For example, the phase transition

phenomenon, out of Landau’s regime, have also proven very hard to grasp, in turn

including new exotic phases of matters. Few cases of these are, the topological phase

of matter: where a particular structure of entanglement spread throughout the system,

quantum spin liquids: phase of matter occur without breaking any symmetry.

The usual approach to perceive these systems is consisted of proposing toy models

32


Figure 3.1: Tensor Network representation of state : the tensor is an elementarybuilding block of a tensor network and tensor network is a representation of a quantumstate (we used the graphical interpretation for a tensor network which will obvious inincoming sections).

that are supposed to replicate the suitable interactions for observed physics. For instance,

the Hubbard model as the specific case of high temperature super conductors. After

proposing such a simple model, excluding some special cases where these models have

an exact solution, one has to rely on numerical techniques to solve the problem.

In this chapter, We will explain basic concepts of tensor network methods and mainly

focus on the matrix product state (MPS). After defining the MPS, we will discuss some

properties of the MPS. The two methods of constructing the MPS will be present, namely

singular value decomposition and Projected Entangled Pair States (PEPS). In the end,

we will discuss some examples for concreteness.

3.1 Necessity of Tensor Network?

One can ask an obvious question why we need tensor network at all in spite of various

numerical methods available to study strongly interacting systems. There is no unique

answer to this question. There could be several reasons to discuss importance of the

tensor network but here we mention four major reasons:

• New limitations for classical simulations: All the numerical techniques devel-

oped so far have their own limitations. Few of these are, the exact diagonalization

of the quantum Hamiltonian is limited to very small systems say 12 or 14 spin-12

particles on a regular computer, thus studying the quantum phase transitions

become a dream in this method. Mean field theory is restricted to incorporate

33


Figure 3.2: Tensor network diagram examples: (a) Matrix Product State (MPS) for4 site with open boundary condition (OBC); and (b) Projected Entanglement Pair State(PEPS) for 4×4 lattice with OBC.

the effect of classical correlations, not quantum correlations. One cannot study

frustrated or fermionic quantum spin systems by using quantum Monte Carlo

due to sign problem. And the strength of DFT depends on the modelling of the

correlations and exchange interactions in between electrons. These are just a few

examples.

TN methods have also limitations as well but their limitations are very different

from existing classical methods: the structure and amount of entanglement in the

state of quantum system. With this new boundaries, one can simulate the range of

models with a classical computer.

• Graphical language for (many-body) physics: Description of a quantum state

is entirely different from usual one. The TN states encodes the structure of quantum

entanglement. Instead of considering messy set of equations, we will be dealing

with tensor network diagrams as shown in Fig. 3.2. It has been realized that the

diagrammatic approach provides a natural language to describe quantum state

of matter, and even for those which cannot be expressed by Landau’s picture such

as topologically-ordered states and quantum spin liquids. This new language for

quantum many-body system provides intuitive and visual description of a system,

and new ideas and results.

• Entanglement structure: Usual approach to describe quantum state does not

allow to visualize the structure and amount of the entanglement among its con-

stituents. It is supposed that structure of the entanglement depends on the dimen-

34


Figure 3.3: Area Law: entanglement entropy S of the reduced system A scales with theboundary of the system not with volume.

sionality of the system i.e. it will be different for 1D system, 2D system, so on and

so forth. It also depends on the state of the system such as at criticality and its

correlation length. The usual way of representing quantum state doesn’t allow to

retrieve such properties about these characteristics. So, it is nice to have a way of

describing quantum states where these properties and information is clear and

easy to access.

To a certain extent, one can think of TN state is referred to a quantum state in

particular entanglement representation. Different states of the system have a

different representation, and the effective lattice geometry: in which the quantum

state really lives, emerges as a result of correlations in the network. At this point,

this is a subtle property but in practice, due to this fascinating idea, a number of

works have been proposed that the pattern of entanglement occurs in the quantum

state cause a geometry and curvature (and hence gravity). And also this property

of TN established the link between machine learning and holographic geometry. It

is proposed that the geometry which respects holography can appear from deep

learning when we train the entanglement feature of a quantum many-body state

[19]. In this connection, entanglement encoded in the neural network(of course

motivated by TN). Here we simply express that it becomes clear that the language

of TN is the right one to follow such type of connections.

• Exponentially larger Hilbert space (“curse of dimensionality”) : This is the

major and important answer to the question why TNs is a natural description

of the quantum many-body state. One of the serious and biggest hurdles to the

numerical and theoretical study of the quantum many-body system is the curse ofdimensionality i.e. Hilbert space of quantum states grows exponentially. In general,

35


this curse puts limits on efficient description (here we meant by efficiency is that

the Hilbert space grows polynomially with the increase in the number of particles)

of states, and their study becomes intractable. For example, the size of Hilbert

space is 2N for a system of N spins-12 . So, representation of a quantum state in

a usual way is inefficient. One can imagine the size of the Hilbert space for the

system of the order of Avogadro number.

Fortunately, a number of important Hamiltonians in nature have locality of in-

teractions among the particles in the system. As a consequence of this fact, very

few quantum states are more relevant than others. To be more specific, one can

prove that ground state or low-energy eigenstates of gapped Hamiltonians: the

finite difference between energies of the ground state and excited states, with local

interactions, obey the area-law for the entanglement entropy [20], as shown in

Fig.3.3. This is remarkable property, it says that entanglement entropy of a reduced

system scales with the area of the system not with volume as opposed to the state

chosen randomly from the Hilbert space follows the volume law. So, we can use

this law as a guide for low-energy states of realistic Hamiltonians because they

are deliberately restricted by locality, they must obey the entanglement area-law.

Addition to that the size of the manifold containing these states is exponentially

small [21], corner of huge Hilbert space as depicted in the Fig. 3.4. If one aims

to study states, relevant, from this corner, then it is better to device a tool which

targets this manifold instead of exploring whole Hilbert space. And the good news

is, there is a family of TN states that target this relevant corner of Hilbert space

i.e. renormalization group (RG) methods. That’s why it is natural to device RG

methods that keep these relevant degrees of freedom and at the same time ignore

irrelevant ones, and are for this reason based on TN states.

3.2 Theory of Tensor Network

In this section we will see TN representation by diagrams and define TN state mathe-

matically as well as diagrammatically. We will also see computational complexity of TN

and explanation of entanglement relation with bond dimension.

36


Figure 3.4: The physical state in a small manifold of Hilbert space: the size of theHilbert space containing the states which obeys area-law is exponentially small cornerof the gigantic Hilbert space.

Figure 3.5: Tensor representation by diagram: here we use superscript for physicalindex as shown in (d).

3.2.1 Tensors and tensor networks in tensor network notation

A tensor is defined, for our purpose, as multidimensional array containing complex

numbers. The rank of tensor corresponds to number of indices. Hence, tensors with

rank-0,1,2 is defined as scalar, vector and matrix respectively as shown in Fig. 3.5. Here

we represent a tensor as a bubble and number of legs attached to the bubble is the rank

of tensor.

An index contraction is one of the important operations when someone dealing with

TN. It amounts to tracing or summing over all possible repeated indices those shared by

collection of tensors. For example, the product of two matrices

Fi, j =D∑

k=1A i,kBk, j, (3.1)

is the trace over repeated index k which can take D possible values. Same is true for

collection of tensors as

E i, j,k =D∑

x,y,z=1Ax,iBx, j,k,y,zCy,z, (3.2)

37


Figure 3.6: Tensor graphical notation: these tensor diagrams corresponds toEq.3.1,3.2,3.3,3.4, and notice that in (b) and (d), bond between tensors B and C isthick which corresponds to higher bond dimensions as compared to other links betweenthe tensors. In practice, we can split and merge any number of indices. In this case, twoindices y, z are merged. There are two and three open indices in diagram (a) and (b)respectively. While diagrams (c) and (d) have no open indices.

where repeated indices x, y, z can take different ranges of values, for simplicity, here

we have defined D number of values for all of them. In Eq. 3.1 result of contraction is a

tensor with open indices: remaining indices left behind after contraction, i and j. Usually,

the outcome of contraction is a tensor. The Tensor Network TN is a collection of tensors

with some contracting pattern, where contraction over all of its indices results a scalar

and contraction over some of its indices outputs a tensor. As

F =D∑

k=1AkBk, (3.3)

E =D∑

x,y,z=1AxBx,y,zCy,z, (3.4)

where F and C in Eq. 3.3 and Eq. 3.4 are complex numbers as a result of contraction

of all indices. In practice, we will see that any number of indices can split and merge

without changing the value corresponds to a particular index. The Graphical notation

for tensor networks in Eqs. 3.1,3.2,3.3, and 3.4 is given in Fig. 3.6.

Manipulating TN by using diagrams is quite easy as compared to a complicated set of

equations. For example, the 1D lattice of 8 particles with PBC is shown in TN notation in

Fig 3.7, or in mathematical language trace of product of 8 matrices. One can compare the

language of TN diagrams to the Feynman diagrams in QFT. Intuitive, allow visualization,

easy to use, and many properties are prominent in the diagram such as cyclic property

or PBC in a lattice. After describing the graphical notation of TN, now we will use TN in

rest of the chapter.

38


Figure 3.7: 1D lattice with PBC: trace of the product of 8 matrices or lattice of 8particles with PBC.

One important question should be mentioned here which also a parameter to evaluate

the efficient TN representation of a state. Is there a way to contract the TN efficiently?

Importance of the question lies in the fact that if we cannot contract a TN efficiently

then that TN worth nothing. Also, there is not a unique way to contract a TN, if someone

starts from the center of the network or use the left corner of the network as a starting

point, the end result will be the same but number of operations required to contract

depends on the order in which network indices are contracted. The efficient TN at least

has one order in which network contraction is optimum.

3.2.2 Wave function as a set of small tensors

Let us represent a quantum state in the TN notation. Consider a quantum system of

N particles each system consists of p-levels in some individual basis |ir⟩, for the state

of every particle in the system r = 1,2, ...N, and ir = 1,2, ..., p. For instance, p = 2 for

spin-1/2 particle. The wave function |Ψ⟩ of this system can be written as

|Ψ⟩ = ∑i1,i2,...,iN

Ci1,i2,...,iN |i1⟩⊗ |i2⟩⊗ ...⊗|iN⟩ , (3.5)

where Ci1,i2,...,iN can take up pN complex values. The quantum many-body system

|Ψ⟩ is consist of tensor product of individual state of the each particle.

The big set of coefficients Ci1,i2,...,iN can be thought of as a big tensor which has

pN values because its rank is N with i1, i2, ..., iN indices and each index can take pvalues. So, coefficients need to describe the wave function is exponential in system size.

The so-called “curse of dimensionality” happens again. One of the main purposes of the

TN approach is to tackle this problem and provide a way to represent quantum state

efficiently. In particular, the number of coefficients required to specify a quantum state

39


Figure 3.8: TN representation of a wave function: (a) is an MPS (b) is a PEPS and(c) is some tensor network which fulfil the demand.

should be polynomial in the system size N. TN methods meet this challenge by providing

the correct description of the supposed entanglement properties of the quantum state.

This is attained by replacing big tensor “C” by much smaller ones as shown in Fig. 3.8.

The total number of parameters gtot required to specify a quantum state is a sum

of all the parameters of an individual tensor g(t), which in turn depends on the rank of

that tensor and number of values corresponds to each index.

gtot =Ntens∑t=1

g(t)=Ntens∑t=1

(O

rank(t)∏at=1

D(at)

), (3.6)

where rank(t) is number of indices of a tensor and D(at) is the number of values

correspond to index at. Lets define Dt the maximum of all the numbers for a given tensor

D(at). Then

gtot =Ntens∑t=1

O (Dt)=O (ploy(N)pol y(D)) , (3.7)

where D is the maximum of Dt, and we supposed that the number of indices a tensor

can take is bounded by some constant.

One example is given in Fig. 3.8(a) is MPS with PBC, it will be discussed in detail.

In MPS, the number of parameters is just O(N pD2), if we consider that open indices in

MPS can take up to p values, and remaining can take up to D values. Still, there are pN

coefficients (corresponding to rank−N) after contracting the TN. But, here magic comes

in, in TN description these pN coefficients are dependent, in fact, they are acquired by

the contraction of a specific TN and hence have a structure. This structure is a result

40


Figure 3.9: Area-law in PEPS: reduced state |A(α)⟩ 2×2 and |B(α)⟩ from 4×4 PEPS.Each broken bond has dimension D and it contributes logD to the entanglement entropy.

of the extra degree of freedom which requires to “glue” small tensors together to form a

TN. These degrees of freedom have a crucial physical meaning: the structure of quantum

entanglement in a state |Ψi⟩ is represented by these degrees of freedom and the number

of values it can take is a quantitative measure of entanglement or quantum correlations

in |Ψ⟩. These degrees of freedom or indices are called bond and the number of values a

particular bond can take is referred as the bond dimension. And the bond dimension

D of a TN is defined as the maximum of all the bond dimensions in the network. In

next section, we will discuss the relationship between bonds and entanglement in a TN

representation.

3.2.3 Entanglement entropy and Area-law

Lets consider an example to understand how entanglement in TN representation related

to bond dimension. Suppose we are given a state in TN representation such as shown

in Fig.3.9 this is PEPS with bond dimension D. Lets approximate the entanglement

entropy (EE) or a region of length L and label all the cross boundary indices of state

|A(α)⟩ and |B(α)⟩ as α. Obviously, if each index that crosses boundary can take up to Ddifferent values then α has D4L number of values. Reduced density matrix for system Acan be written as

ρA = ∑α,α′

Xα,α′ |A⟩⟨A| , (3.8)

where Xα,α′ ≡ ⟨B(α′)| |B(α)⟩. The reduced density matrix has rank, that is, at most

41


D4L whether we consider system A or B. Further, the EE of a reduced system A is given

by

S(ρA)=−tr(ρA logρA). (3.9)

S(ρA) is upper bounded by log of rank ρA, and the EE in this case is

S(ρA)6 4LlogD, (3.10)

this is an upper-bounded version of “the area-law” for the EE. We also can deduce

from this equation is “every broken bond index contributes logD in total entropy”.Lets discuss the area-law for different possible states. If the given state is just a

product state then its bond dimension will be D = 1 and there is no entanglement shown

by EE hence S = 0. This is a general result of a TN if the bound dimension is trivial

then there is no entanglement. The second case can be D > 1, it already holds the area

law and increasing the D results the change of the multiplicative factor. Therefore, to

change the entropy, structure or geometric pattern of TN should be changed because Ldepends on the geometry. Conclusion of above fact is, entanglement depends on the bond

dimension D and how these bonds are connected together define a geometric pattern.

That’s why different families of TN states with same D have entirely different properties

of entanglement. Notice that, by fixing the value of D greater than one we can achieve

both computational efficiency and quantum correlations. Larger region of the Hilbert

space can be explored by increasing the bond dimension as shown in Fig. 3.10. The

entropy given in Eq. 3.9 is called von-Neumann entanglement entropy. There are other

type of entropies to quantify entanglement, the whole set of EEs is called Renyi-entropies

S(ρA)= 11−α log(ραA), (3.11)

for α> 0. For α→ 1, this formula reduced to von-Neumann entropy.

3.2.4 Proven instances and violations of Area-law

One can make the general category of systems which hold area-law by stating the fact

that most of the systems with gapped Hamiltonians follow the area-law while gapless

systems do not. There are examples of gapless Hamiltonians those follow the area-law

but the statement is true in general. For instance, free bosons hold area law even at

criticality, gapless models with the dimension of systems greater than 1. For gapped

free fermionic models where Hamiltonian expressed in quadratic polynomial of creation

42


Figure 3.10: By increasing the bond dimension D of a TN state one can explore largerregion of Hilbert space.

and annihilation operators, area-law holds for arbitrary lattice system in any dimension.

And more importantly, MPS and PEPS also follow area-law. It has been discussed that

efficient representation of a quantum state in TN methods is due to this area-law. There

are many other examples but in general, any gapped Hamiltonian with unique ground

state respects area-law.

Systems at criticality or gapless systems don’t respect area-law. There are correlations

in the system at every length scale hence correlation decay is not exponential and area-

law is violated. But corrections to the area-law are small. Conformal Field Theory (CFT)

proposed that

S(ρA)= c3

logla+C, (3.12)

where a is the lattice spacing, c is the centeral charge, and C > 0 is a constant [22].

So, in fact entropy is in order of area not boundary of an area as

S(ρA)=O (log(|A|)) , (3.13)

where A is area. It is divergent in the log of the system size. Free fermionic systems

at critical point violate the area-law. Area-law scaling for higher-dimensional critical

systems is still unknown.

3.2.5 Entanglement spectra

Often entire spectrum of ρA is useful instead of just a single number provided by

the entropy of entanglement. In reality, more information is revealed if entanglement

Hamiltonian is considered. And information provided by Renyi-entropies is same as given

by entanglement spectrum of a reduced system ρA. Provided a state ρA, entanglement

Hamiltonian HA is defined as

43


ρA = e−HA . (3.14)

In fact, important information of a system revealed by the entire entanglement

spectrum, for example, one can extract information about universality class of the

system. Entanglement Hamiltonian is receiving a notable consideration in the context of

boundary theories and topological systems.

3.3 Matrix Product States (MPS)

After establishing the notation and some key concepts, we will discuss some fundamental

TN for quantum many-body systems with strong interactions among its parts. We will

discuss one dimensional system first.

An MPS is a natural choice(also efficient) for representing the low-energy states

of 1D quantum systems more precisely physically realistic quantum systems. In this

section, first of all we will discuss some properties of MPS after that we construct the

MPS by considering singular value decomposition briefly and motivate and define TN

in two different ways. After that, some analytical examples of this TN and complexity

tackle by MPS will be discussed. At the end, some simple characteristics of MPS and

operators in MPS will be explained.

3.3.1 Some properties

• MPS are dense: Any quantum state can be represented by the MPS, by increasing

the bond dimension D one can explore the larger region of many-body Hilbert space.

All the states in the Hilbert space can be covered by MPS but the increase in the

bond dimension D will exponential with the system size. Nevertheless, MPS can

represent the low energy states of the gapped local Hamiltonian to arbitrary

accuracy with a finite value of D [23]. But for the critical systems, D diverges

polynomially with the system size. This is shown in Fig.3.10.

• 1D Translational symmetry and thermodynamic limit: In general, finite-

sized MPS is not itself symmetric under translation because all the tensors can

be different from each other. To apply thermodynamic limit one can choose a unit

cell in the array of tensors. If all the tensors are same then MPS will translational

invariant under any number of the tensor shift. And unit cell could be consist of

two or three tensors as well. The idea is pictorially shown in Fig. 3.11(a).

44


Figure 3.11: (a) Infinite-sized MPS with 1 and 2 site unit cell. (b): An efficient wayto contract finite-sized MPS which can also be applied to infinite-sized MPS with anyboundary condition.

• 1D area-law: MPS respects area-law of entanglement entropy for 1D systems. In

particular, the entanglement entropy of a reduced system is bounded by constant

i.e. S(L)=O(logD). Intuitively, to have a reduced system we need to cut two bonds

independent of the size of the reduced system. Strictly, S(L) ∼ const for L >> 1,

this is the property of the ground state of the gapped Hamiltonian of 1D systems.

• Exponential decay of correlations: MPS always show finite correlation length,

this means that correlation length decays exponentially in the MPS. That’s why

they are not fit for critical systems and cannot reproduce the scale-invariant

properties where systems show the power-law decay of correlations.

• Efficient contraction of expectation value: The scalar product for finite-sized

MPS can be contracted in a time O(N pD3) and infinite-sized MPS require O(pD3).

The contraction strategy is shown in Fig.3.11(b). The same kind of manipulation

can be used to calculate expectation value of operators.

3.3.2 Singular value decomposition

One of the main purposes of TN states for quantum many-body systems is representing

states which reside in the physically relevant corner of a huge Hilbert space. These

states are low-entanglement states. The main idea behind this statement is finding

low-rank approximation of a big matrix and this task can be achieved by singular value

decomposition (SVD) of a matrix. The SVD of a high-rank matrix Ci1,i2,...in; j1, j2,... jm is

given by

Ci1,i2,...in; j1, j2,... jm =∑α

Ui1,i2,...in,αSα,αV †j1, j2,... jm,α, (3.15)

45


where U and V are unitary matrices and S is a diagonal matrix containing non-

negative values, and called the singular matrix. U and V are not unique but S always

unique for a particular matrix. One can see that the TN notation of this decomposition in

Fig. 3.12. The SVD is also a way to calculate the EE between two systems and singular

matrix S is used for this purpose.

Figure 3.12: Singular Value Decomposition (SVD) Eq. 3.13 in TN notation: hereI ≡ {i1, i2, ...in} and I ≡ { j1, j2, ... jn}.

As already said, a tensor is just a container which contain certain set of values and

allows access to these values by indexing. There is no harm in splitting and merging

these indices together because it does not matter how someone arrange the values.

3.3.3 MPS construction

Suppose we are given a general N-qubit state |Ψ⟩ =∑pi1,i2,...,iN

Ψi1,i2,...,iN |i1⟩⊗ |i2⟩⊗ ...⊗|iN⟩ in which each qubit is a p-dimensional system. All states can be identified by the

knowledge of complex coefficients C.

After separating the first index of C from the rest, and applying the SVD we get the

decomposition

|Ψ⟩ =∑jλ j |L j⟩ |R j⟩ , (3.16)

where {|L j⟩} and {|R j⟩} are orthonormal basis and λi is Schmidt eigen value or weight.

In TN notation this process shown in Fig. 3.13.

Figure 3.13: Pictorial presentation of SVD performed in Eq. 3.16.

46


Figure 3.14: Complete construction of an MPS by SVD.

The Renyi-entropy is given in Eq. 3.11, note that the EE is just log sum of non-zero

Schmidt weights. And these weights correspond to singular matrix which consists of

singular values of decomposition. Hence singular values represent the entanglement

structure along the applied cut.

We can apply consecutive SVD along each cut, thus splitting out the whole tensor

into small local tensors X . And singular matrices λ quantify the entanglement over that

cut as shown in Fig. 3.14. Now contract the singular matrix or tensor λ[i] into the local

tensor X [i], consequently we have a general form of quantum many-body state |Ψ⟩ as

. (3.17)

The graphical representation given in Eq. 3.17 is a matrix product state (MPS).

This construction is generic and contain same number of parameters in much more

complicated fashion. In fact, the usefulness of this construction is not obvious yet but

will clear in a moment.

However, the states with short range entanglement across the cut have only few

non-zero Schmidt weights, lets call D the number of non-zero Schmidt weights. For these

states, MPS form enables us to truncate the λ matrix. To be more specific, any state that

respects area-law such that S0 < log c where c is some constant across any bipartition

can be expressed completely using an MPS with just O(dNc2) parameters. For many

relevant states with von-Neumann entropy S1 =O(1) is enough to make sure arbitrarily

best approximation with an MPS of pol y(N) bond dimension.

47


In MPS, all the tensors are ranked-3 tensors and dangling indices are called physicalindices while contracted indices are called bond or virtual indices. Sometimes it is

convenient to consider PBC, and to tackle periodic states, an MPS is commonly modified

from Eq. 3.17 to

∣∣∣Ψ[A[1], A[2], ..., A[N]

]⟩= ∑

i1,i2,...iN

Tr[A[1]

i1, A[2]

i2, ..., A[N]

iN

]|i1, i2, ...iN⟩ , (3.18)

and for the translational invariant states all the A′s are same

∣∣∣Ψ[A[1], A[2], ..., A[N]

]⟩= ∑

i1,i2,...iN

Tr[A i1 , A i2 , ..., A iN

] |i1, i2, ...iN⟩ . (3.19)

This representation may look scary but its really simple, one has to specify the

physical indices and rest of the task is just matrix multiplication. This state in TN

notation is given as

. (3.20)

Lets discuss the characteristics of the set of A matrices. These matrices are left-normalized i.e.

∑i

A†i A i = I. (3.21)

The MPS that have the only left-normalized set of matrices are called left-canonical.Clearly, there is nothing special to start slicing the huge matrix from left, one can start

applying SV D from right to left the resulting MPS is called right-canonical MPS as it

holds the property

∑i

A i A†i = I. (3.22)

One can also choose the centre of a big matrix as a starting point. All these choices

can be a representation of the same system because these are not unique. One can rid of

this non-uniqueness by fixing the gauge.

3.3.4 Gauge degrees of freedom

There can be various ways to write an MPS for arbitrary quantum state. All the ap-

proaches have their own advantages and disadvantages. It should be noticed that MPS

48


are not unique and this implies that the existence of the gauge degree of freedom. Sup-

pose two consecutive set of matrices Bi and Bi+1 of common column/row dimension D.

Then the MPS will not vary for any matrix X with dimension (D×D) and have inverse

as well, under transformation:

Bi → Bi X , Bi+1 → X−1Bi+1. (3.23)

By fixing the gauge, calculations become very simple, all the constructions of MPS are

special cases of that.

The matrices can possibly be huge one needs to restrict their size to some D to make

MPS simulatable on a computer. This is doable without much compromising on the

description of the state defined on one dimension. Often, MPS canonical representation

has exponentially decreasing eigenvalues of the reduced density matrix of the system.

It is possible to keep the first few eigenvectors of the reduced density matrix (Eq.3.16)

which have greater singular value than some threshold D. The D defines the order of

precision in which we are interested.

3.4 1-D Projected Entangled Pair States(PEPS)

Another way to construct MPS is, consider an MPS is a special case of the PEPS. This

begins by laying out some entangled pair state |φ⟩ such as Bells state on a certain lattice

and performing some linear map P between pairs

, (3.24)

where

, (3.25)

is a particular entangled pair that we chose. It is an easy task to show that this

construction leads us to an MPS by considering |φ⟩ =∑d−1k=0 |kk⟩. The linear map P can be

written as

P= ∑i,µ,ν

A i,µ,ν |i⟩⟨µν| . (3.26)

49


The tensor A is precisely the MPS we defined above, applying projection operator to the

auxiliary spins which are entangled pairs result in the exact construction of an MPS:

P(1) ⊗P(2) |φ⟩2,3 =∑

i1,i2,µ1,ν1,µ2,ν2,kA(1)

i1,µ1,ν1A(2)

i2,µ2,ν2|i1i2⟩⟨µ1ν1µ2ν2| (I ⊗|kk⟩⊗ I)

= ∑i1,i2,µ1,ν1,ν2

A(1)i1,µ1,ν1

A(2)i2,ν1,ν2

|i1i2⟩⟨µ1ν2| .(3.27)

Hence we notice that the two definitions are the same, and exchanged through

employing the local unitaries to the auxiliary/virtual bond indices of A or similarly using

the different maximal entangled pair in PEPS.

3.5 Examples of MPS

Lets take some concrete examples to make MPS less abstract.

AKLT State

The most interesting quantum many-body state to study correlations is AKLT-state, it is

the ground state of the following Hamiltonian:

H =∑i

Si.Si+1 + 13

(Si.Si+1)2. (3.28)

Figure 3.15: AKLT state: Each physical site represents spin−1 which is replaced by thetwo spin−1/2 degrees of freedom (called auxiliary spins). Each right auxiliary spin−1/2on a site i is entangled to the left spin−1/2 at site i+1. Linear projection operator isdefined on the auxiliary spins which maps them to physical spins.

Here S = 1 spins. The ground state of this Hamiltonian can be constructed by using

PEPS as depicted in Fig.3.15. Each physical spin-1 is replaced by 2 spin-1/2 particles,

out of four possible states we only take triplets to represent S = 1 states:

|+⟩ = |↑↑⟩

50


|0⟩ = |↑↓⟩+ |↓↑⟩p2

(3.29)

|−⟩ = |↓↓⟩ .

On consecutive sites, adjacent spin-1/2 particles are paired in a singlet state

|↑↓⟩− |↓↑⟩p2

. (3.30)

This state can be represented by MPS with bond dimension D = 2. In the description

of the auxiliary 2L spin-1/2 on one dimension lattice with length L any state can be

written as

|Ψ⟩ =∑µν

cµν |µν⟩ (3.31)

where |µ⟩ = |µ1, · · · ,µL⟩ and |ν⟩ = |ν1, · · · ,νL⟩ constitute the first and second spin-1/2

particle on every site. And the single bond between k and k+1 site

|σ[k]⟩ =∑νkµk+1 |νk⟩ |µk+1⟩ (3.32)

defining a 2×2 matrix σ

σ= 0 1p

2− 1p

20

. (3.33)

Any state with singlets and all the bonds

|Ψσ⟩ =∑µν

σν1µ2σν2µ3 , · · · ,σνL−1µLσνLµ1 |µν⟩ , (3.34)

for PBC. For OBC the first and last spins-1/2 will be single by ignoring σνLµ1 .

Now we are going to map these auxiliary spins-1/2, |µk⟩ |νk⟩ ∈ {|↑⟩ , |↓⟩}⊗2 to the physi-

cal spins |i⟩ ∈ {|+⟩ , |0⟩ |−⟩}. According to previous section PEPS now we need a projection

operator which maps auxiliary Hilbert space to physical one. We define M iµν |i⟩⟨µν| as

three 2×2 matrices for each eigenvalue of |i⟩

M+ =[

1 0

0 0

], M0 =

0 1p2

1p2

0

, M− =[

0 0

0 1

]. (3.35)

The map for spin-1 chain |i⟩ given as

51


∑i′s,µ,ν,

Mi1,µ1,ν1 M(2)i2,µ2,ν2

, · · · , MiL,µL,νL |i1, i2, ...iL⟩⟨µν| , (3.36)

application of this map on |Ψσ⟩ results in

|Ψ⟩ =∑i′s

Tr[Mi1σMi2σ · · ·MiLσ

] |i1, i2, ...iL⟩ . (3.37)

For more simplification we introduce A i = Miσ, hence the AKLT state is

|Ψ⟩ =∑i′s

Tr[A i1 A i2 · · ·A iL

] |i1, i2, ...iL⟩ . (3.38)

Product State

Let

A0 =[1]

, A1 =[0]

. (3.39)

This set of matrices provides the state |00 · · ·0⟩.

W State

To get the W-state for n-particles we define matrices A as

A0 =[

1 0

0 1

], A1 =

[0 1

0 0

], (3.40)

with periodic boundary conditions

. (3.41)

Notice that A20 = A0, A0A1 = A1, A2

1 = 0 and define a matrix X as Tr[A1X ] = 1, hence

the MPS representation of state

|W⟩ =N∑k|00 · · ·01k00 · · ·0⟩ . (3.42)

Mathematica code is give in the Appendix A.1.

52


GHZ State

The set of matrices A for GHZ state is given as

A0 =[

1 0

0 0

], A1 =

[0 0

0 1

], (3.43)

and equivalent MPS representation by PEPS if we define |φ⟩ = |00⟩ + |11⟩ and P =|0⟩⟨00|+ |1⟩⟨11| then we get

|GHZ⟩ = |00 · · ·0⟩+ |11 · · ·1⟩ . (3.44)

Mathematica code is give in the Appendix A.2.

53

CH

AP

TE

R

4MAPPING BETWEEN VARIATIONAL RG AND RBM

This chapter is devoted to explain the relation and mapping between renormalization

group (RG) and restricted Boltzmann machine (RBM). We will begin with RG and show

how theories at different length scale constructed by coarse-graining process in RG.

In particular, RG transformation on block spin lattice in 1D as well as in 2D will be

explained. After that, we will see a detailed explanation of mapping between variational

RG and RBM, and consider Ising model in 1D and 2D as examples.

4.1 Renormalizaton group (RG)

The renormalization group theory is one of the fundamental ideas on which theoretical

structure of Quantum Field Theory (QFT) is based. That belief leads us to the suggestion

that the potential ideas of RG can be viewed from the point of view of statistical physics

and condensed matter physics.

In early days, RG was the subject of the high energy physics such as renormalizaed

Quantum Electrodynamics later on it was realized that RG ideas can be applied to

statistical and condensed matter physics as well. In high energy physics domain, RG

methods are performed in momentum space, hence called momentum-space RG on the

other hand real-space RG ideas are for statistical physics. From now on we will only

discuss real-space RG method.

The RG is a framework or a set of ideas that can tackle problems in which fluctuations

occur in a system at all length scale, for example, the systems at criticality. At the

54

CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM

critical point of the system our previous approach, Mean Field Theory (MFT), breaks

down because in the mean-field model we suppose that a particle interacts with other

particles in the whole system with equal strength. Usually, we account nearest-neighbor

interactions and the process gets complicated by including next range of interactions.

The RG approach uses a very interesting set of conceptual ideas such as scale-invariance:singularities appear at critical points are connected to that behaviour of the statistical

system which is same at all length scale, so at criticality, the system has correlations

at all length scale. That scale-invariance is expressed by the free energy function or

Hamiltonian of the system which in turn consists of the components that do not depend

on length scale, so, as a result, the Hamiltonian or free energy of a system will not change

with the change in length scale. This “unchanging” response is expressed as “being atfixed point”. Repeated application of RG on the system which changes its length scale but

keeping its free energy unchanged is defined as RG transformation or renormalization.

It is observed that near criticality many different systems have identical quantities

which describe the critical behaviour (critical exponents). So these systems fall into same

universality class meaning that different systems from entirely different origins behave

identically near critical point hence they have same critical exponents. For example,

para-ferromagnet and liquid-gas phase transitions belong to same universality class.

The idea of this representation as opposed to the previous one is that in several

important instances one may interested in not only what these universal properties are,

but also explicit temperature or phase diagram of the system as a function of external

parameters as well as temperature. And if we get some idea how microscopic degrees

of freedom interact in our chosen system then we should be able to solve the partition

functions ( for Ising model or lattice model like that) and get singularities etc. For

example, for a particular system, someone interested in the critical temperature where

phase transition occurs as well as potential phase diagram and critical behaviour.

No matter what the motivation is, all RG methods provide a set of mathematical

equations that define renromalization flow in some complex parameter space and these

flows tell us a lot about the physical problem at hand, which is the quality of the RG

theory. But these methods are difficult to control quantitatively as the expansion of the

parameters become large. Despite this challenge, RG provides powerful and advanced

concepts such as scaling and universality.

55


4.1.1 1D Ising model

Let’s begin with the model that does not show the critical behaviour but exactly solvable.

The 1D model represents the physical spins vi lying on the lattice with some lattice

constant. When external magnetic field not present the Hamiltonian of the system look

like

H =−K∑

ivivi+1. (4.1)

where K is the coupling constant that favours the aligned spin configuration. There

could be various schemes to perform RG transformation one could be marginalize half of

the spins which doubles the lattice constant and K (1) will be the couplings between new

coarse grained system. We can again coarse-grained this system and that will be the next

round of RG. Let’s denote a set of couplings constants as K (0),K (1), · · · ,K (n) which define

interactions between spins after RG transformation. The output of the RG procedure is

an elegant recursive relation between the couplings and that defines a flow

tanhK (n) = tanh2 K (n−1). (4.2)

Here K = K (0) and the detailed calculations of RG transformation for 1D Ising model is

given in AppendixD.

4.1.2 2D Ising model

Now consider the system that does have critical behaviour. The first step in RG procedure

is decimation/coarse-graining, there can be many renormalization schemes to achieve

this, one possible shown in Fig. 4.1. To perform this step first notice that every spin has

four nearest-neighbours we arrange the partition in such a way that every spin shows

up in only one Boltzmann factor:

Z = ∑v1,v2,···

· · · eKv5(v1+v2+v3+v4)eKv6(v2+v3+v7+v8) · · · . (4.3)

After doing sum over every other spin, we have to find the Kadanoff transformation so

that summed partition function exactly look as the unsummed partition function. But

this time this task is not as simple as in 1D Ising model. We have to include all type of

couplings as shown in Fig. 4.2 otherwise there is no solution. For this case single K will

not work we have to include all three possible couplings {K1,K2,K3}:

56


Figure 4.1: The RG decimation scheme: every spin has 4 nearest-neighbours and wesummed over half of the spins. The resulting lattice is same as original but rotated at anangle 45◦.

eKv5(v1+v2+v3+v4) + e−Kv5(v1+v2+v3+v4) = g(k)e12 K1(v1v2+v2v3+v3v4+v4v1)+K2(v1v3+v2v4)+K3v1v2v3v4 .

(4.4)

After inserting all possible values for (v1,v2,v3,v4), we obtain four equations with four

unknowns. There solutions are:

K1 = 14

lncosh(4K),

K2 = 18

lncosh(4K),

K3 = 18

lncosh(4K)− lncosh(2K),

g(K)= 2[cosh(2K)]1/2 [cosh(4K)]1/8 .

(4.5)

Combine Eq.4.4 with the partition function(partial) produces

Z(K , N)= ∑N spins

eK∑′

i, j viv j

= [g(K)]N/2 ∑N/2 spins

e[K1

∑′i, j viv j+K2

∑′′l,m vlvm+K3

∑′′′p,q,r,t vpvqvrvt

],

(4.6)

where N/2 summed is over remaining spins. Single primed sum is over nearest-

neighbours and double primed for next nearest-neighbours while triple prime for four

product spin interaction around the square. Notice that we got complicated connectivity

57


Figure 4.2: In coarse-graining process, result of removing the degrees of freedom wegot high degree of connectivity. Yellow lines show nearest-neighbour couplings v1v2 +v2v3 + v3v4 + v4v1, green lines represent next-nearest neighbours v1v3 + v2v4 and blueconnections depict all four spin product v1v3v2v4.

(as shown in Fig.4.2) after removing the degrees of freedom. This is the major complica-

tion in RG treatment this motivates us to variational renormalizaton techniques we will

see that in next section.

These complicated couplings do not allow us to express partition function into a

form of Kadanoff transformation and we cannot perform RG calculations. We need to

approximate the couplings somehow to proceed further. One possible solution could be

neglect K2 and K3 but resulting equation will be the same as we have for 1D Ising

model, predicts no phase transition. For better approximations, atleast include K2

in calculations. One appropriate solution can be mean-field approximations in which

next nearest couplings incorporate into nearest ones as K1∑′

i, j viv j +K2∑′′

l,m vlvm ≈K

′(K1,K2)

∑′i, j viv j. This approximation enable us to describe partially summed Z in the

form of unsummed Z

Z(K , N)= [g(K)]N/2Z[K′(K1.K2), N/2].

Lets define free energy per particle as f (K)= N−1 ln Z(K , N) and use the value of g(K)

that leads us to

f (K′)= 2 f (K)− ln{2[cosh(2K)]1/2[cosh(4K)]1/8}. (4.7)

Consider the energy when system is highly ordered i.e. all the spins are aligned. For 2Dsquare lattice with N/2 spins, there are N nearest and also N next nearest neighbour

bonds. So, K1∑′

i, j viv j = NK1 and K2∑′′

l,m vlvm = NK2, when all spins are parallel. Now

we estimate K′as K

′ ≈ K1 +K2. By using 4.5 we get

58


K′ = 3

8lncosh(4K). (4.8)

So we have now recursive relations for renormalization , notice that Eq. 4.8 has nontrivial

“fixed point”. A finite value of Kc exist where

Kc = 38

lncosh(4Kc).

Indeed,

Kc = 0.50698.

Figure 4.3: RG flow diagram for 2D Ising model: there are three fixed points, twoare stable (K = 0,∞) and one is unstable (Kc). Kc is a phase transition point.

If renormalization start at Kc, Fig. 4.3 depicts that iterations move away from Kc.

This non-trivial fixed point is unstable and often called “source”. The other two fixed

points are called “sink”.This RG treatment on 2D Ising model shows that even with rudimentary approxima-

tions, the RG methods are powerful enough to extract qualitative results. Before going

to next section, we summarize few observations made in this whole topic. First is the

high degree of connectivity that results in cooperation between variables and a cause of

phase transition as well. That topology also leads to very complicated interactions as one

sum over more and more variables. One can imagine that the complicated interactions

by considering the integration over 5 spins in given 2D lattice as oppose to 1D case. One

cannot generate that high degree of connectivity in 1D Ising model and that model has

no phase transition phenomenon.

Indeed, if we neglect the complicated interactions then RG process will be very easy

but that leads us to wrong predictions. One way to think that removing the degrees of

freedom take us to higher dimensional space of couplings, in this case, K1,K2 and K3.

And the resulting partition function Z depends upon all these coupling parameters. One

has to consider all imaginable couplings to compute partition function by RG process

and that results a transformation to higher dimensional coupling space i.e. {K′}=R{K}.

It is only by an estimation that we restrict the flow diagram in just 1D parameter space.

For more detailed calculations use Ref. [24][25][26].

59


Above discussed difficulties motivate us to consider a numerical technique to perform

RG transformation. We can use some variational methods to find an exact transformation

and this is the topic of next section.

4.2 Variational RG

In the previous section, RG theory is explained by two simple examples. And we have

seen the major difficulty in RG transformation is taking account of all the possible

couplings. The variational approach was introduced to tackle this difficulty and allows

renormalizaton.

Consider an ensemble of N spins v = {vi} sitting on some lattice, each spin can

take two values ±1 and index i denotes the position of spin on the lattice. At thermal

equilibrium the fundamental formula of statistical mechanics that relates the probability

of a particular configuration of the system to its energy:

P(v)= e−H(v)

Z, (4.9)

Z = Trve−H(v) ≡ ∑v1,v2,··· ,vN=±

e−H(v), (4.10)

where H(v) is Hamiltonian of the system and Z is partition function.

For convenience, set the temperature T = 1. Usually, and we have seen in 2D Ising

model, all possible Hamiltonians parameterized by all imaginable couplings K= {K}. For

instance, there are 3-order of interactions {K1,K2,K3} in Eq. 4.6. Order of interactions

depends on topology of the system, more general type of interactions can be

H[v]=−∑i

K ivi −∑i, j

K i, jviv j −∑i, j,k

K i, j,kviv jvk +·· · . (4.11)

Given partition function Z, typical way to define free energy

Fv =− log Z =− log(Trve−H(v)

). (4.12)

The idea on which the RG is based, given a fine-grained description of the system, find

a coarse-grained one by summing over short distance fluctuations. New coarse-grained

system consists on the new and relatively few M < N degrees of freedom h = {h j}. In

this process, the new coarse-grained description of the system has larger characteristic

length scale (lattice spacing) than previous one. One can easily rescale the system to

60


Figure 4.4: Block spin transformation: (a) 2×2 blocks are defined on the physicalspins vi to marginalize, (b) depicts the effective spins hi after marginalizing the physicaldegrees of freedom, (c) side view of RG procedure is shown, repeated application of RGtransformations produce series of ensembles one over the other.

have same characteristic length. For instance, the block spin RG transformation defined

by Kadanoff is shown in Fig.4.4 where hi “auxiliary” spins represents the state of 2×2

local block of “physical” spins vi. Under this renormalization scheme the lattice spacing

doubled after each step of RG.

In this way, RG allows us to replace an ensemble consists of N degrees of freedom vwith an ensemble that composed of fewer degrees of freedom h. New system also have

same form of Z with new coupling parameters K′ = {K′} and Hamiltonian HRG[h] defined

on this new system parameterized by these couplings. And {K′} include the effect of

couplings in the fine-grained system.

HRG[h]=−∑i

K′ihi −

∑i, j

K′i, jhih j −

∑i, j,k

K′i, j,khih jhk +·· · , (4.13)

where hidden spins h are coupled with each other by K′. In simple terms, this renormal-

ization transformation is just a mapping between couplings K→K′. The exact mapping

highly depends on the chosen RG scheme and often difficult to solve this problem analyt-

ically.

Kadanoff introduced the variational approach to RG scheme. In his proposed vari-

ational scheme, the coarse-graining process is performed by establishing a projection

function Tχ(v,h) that parameterized by the variational parameters {χ} and encodes

interactions between hidden spins h and physical spins v. After coupling the v and h,

one can summed up physical/visible spins to achieve a coarse-grained description of the

61


fine-grained system completely in terms of h. Naturally, the function Tχ represents a

Hamiltonian for h as

e−HRGχ [h] = TrveTχ(v,h)−H(v). (4.14)

The free energy of the coarse-grained system can also be defined in similar way

Fhχ =− log

(Trhe−HRG

χ (h)). (4.15)

So far we did not pay attention on variational parameters χ for Tχ. The exact renor-

malization procedure preserves the physics at large length scale, to make sure this

quantitatively, the energy of the coarse-grained system should be equal to fine-grained.

So, choose χ such a way that ∆F = Fhχ −Fv. For exact transformation using any RG

scheme

Trh j eTχ(v,h) = 1. (4.16)

Notice that its true in this case as well

∆F = 0⇐⇒ Trh j eTχ(v,h) = 1. (4.17)

Usually it is hard to carry out exact transformation by choosing optimum χ. So,

various optimization and variational techniques has been proposed to determine χ by

minimizing ∆F.

4.3 Overview of RBM

Before going into the details of the mapping between RBM and RG, lets introduced

some convenient notation for RBM. True probability distribution is P(v) over N visible

units and M is the number of hidden units. Here Energy function is defined as E(v,h)=aTv+bTh+hTWv. The set of variational parameters are χ= {ai,Wi j,b j} and the joint

probability distribution Pχ is

Pχ(v,h)= e−E(v,h)

Z, (4.18)

and the marginal distributions given as

Pχ(v)=∑h

Pχ(v,h)= TrhPχ(v,h), (4.19)

62


Pχ(h)=∑v

Pχ(v,h)= TrvPχ(v,h). (4.20)

Lets define a “variational” RBM in terms of Hamiltonian for convenience in the future.

For the hidden units:

Pχ(h)= e−HRBMχ [h]

Z, (4.21)

and for the visible units:

Pχ(v)= e−HRBMχ [v]

Z. (4.22)

The training process of the RBM is based on the minimization of the gradient of the

log-likelihood. The detailed explanation and training process of the RBM, deep neural

networks (DNNs) and deep belief networks (DBNs) given in the chapter 2.

4.4 Mapping between Variational RG and RBM

In RBM the energy of the configuration of visible and hidden units is given by E(v,h). In

variational RG, similar role is played by the Tχ(v,h) which encodes the coupling between

visible and hidden units. We claim here and will prove the relation between these two

T(v,h)=−E(v,h)+H[v], (4.23)

the H[v] is Hamiltonian that describes the data and encodes probability distribution

P(v) of the data, given in Eq. 4.11. Our claimed relation defines the mapping between

variational RG and a particular type of DNN i.e. RBM.

Using this relation it is easy to express that HRGχ [h] which defines the coarse-grained

system in variational RG, also represents the hidden spin variables in the RBM. Or other

way around, the marginal distribution Pχ(h) over the hidden spin variables of the RBM

is Boltzmann distribution weighted by the HRGχ [h]. To prove this lets begin with Eq. 4.14

and normalize the equation by Z and substitute our claimed relation Eq.4.23

e−HRGχ [h]

Z= Trv

e−E(v,h)

Z= Pχ(h). (4.24)

Put Eq.4.21 into above expression, hence

HRGχ [h]=HRBM

χ [h]. (4.25)

63


By using these results, variational RG can be viewed in the language of the prob-

ability theory. The projection operator Tχ(v,h) can be interpreted as the variationally

approximated conditional probability of the hidden layer given the visible layer. To get

intuitive expression, take exponential of the Eq. 4.23 and use the joint and marginal(for

visible) probability distributions. The desired result is

eT(v,h) = Pχ(h | v)eH[v]−HRBMχ [v]. (4.26)

It indicates that the equality of the variationally approximated Hamiltonian HRBMχ [v]

of the visible units in the RBM and Hamiltonian H[v] that describes the data, ensures

the exact RG procedure i.e. TrheTχ(v,h) = 1. So, HRBMχ [v]=H[v] also implies that T(v,h)

is exact conditional probability distribution and two probabilities, Pχ(v) probability

learned by model and true probability P(v), are equal. In machine learning terms, the

gradient of the log-likelihood approaches zero.

The exact RG transformation carry out by the various variational approximation

schemes. On the other hand, machine learning uses minimization of the log-likelihood

to approximate. Thus, RG and deep neural networks provide well defined variational

scheme for coarse-graining process. Lastly, the equivalence of the two approaches does

not depend on the specific form of the energy E(v,h) so this is also true for any Boltzmann

machine.

4.5 Examples

To illustrate the relation between RG and deep learning approaches, it is a good idea to

discuss some simple examples. One (1D Ising model earlier in this chapter and detailed

calculations given in AppendixD) we have briefly explained from the point of physics

now we consider it in the perspective of deep learning. After that we will apply deep

learning variational approach to the 2D-Ising model.

4.5.1 Ising model in 1D

For 1D-Ising model, the coarse-graining process by RG transformation is shown in

Fig.4.5(a). We have considered nearest-neighbour couplings K (0) in the fine-grained sys-

tem and K (n) are the couplings in the coarse-grained system after n RG transformations.

If we start with the large value of K then we flow towards the weak couplings, it is clear

from the RG flow Eq.4.2.

64


Figure 4.5: RG and deep learning aspect of 1D Ising model: (a) a coarse-grainedprocess by the renormalization transformation of the ferromagnetic 1D Ising model. Afterevery RG iteration, half of the spins marginalize and the lattice spacing becomes double.At each level the system replaced by the new system with relatively fewer degrees offreedom and new couplings K ’s. By the RG flow equation, the couplings at the previouslayer can provide the couplings for the next layer. (b) The RG Coarse-graining can alsobe performed by the deep learning architecture where weights/parameters between nand n−1 hidden layer are given by K (n−1).

The RG transformation can be used as a guide for deep architecture as shown in

Fig.4.5. The spins at n-th layer in the DNN can be interpreted as the coarse-grained

spins when implementing RG transformation in n−1th layer. It should be noted that

the first two layers of the DNN shown in Fig.4.5(b), form an “effective” 1D spin chain

similar to actual one. So, decimating every other spin in the actual spin chain is equal to

marginalizing/summing over spins in the layer below in the DNN. This suggests that

the hidden layer in the DNN is also defined by the RG transformation Hamiltonian

when K (1) are the local short-range couplings. And this is true for any layer of the DNN

architecture for coarse-graining process.

This simple example gives some idea about how to construct the DNN, in particular,

selection of the depth and width of the network and require no calculations. These two

deep network parameters are called hyper-parameters and deep learning community

don’t have any neat and clean guiding principle to choose these, it shows the essence of

the mapping between DNN and RG. Although, it tells us nothing about the half spins

which are not connected to the hidden layer.

65


Figure 4.6: Deep neural network for the 2D Ising model: (a) four-layered DBN isconstructed with size of each layer 1600,400,100, and 25 spins. The network is trainedover the samples generated from 2D Ising model with J = 0.405. (b) The EffectiveReceptive field (ERF): encodes the effect of input spins to a particular spin in the givenlayer, of the top layer. Each image size is 40×40 which depicts the ERF of a particularspin in the top layer. (c) The ERF gets larger as we move from bottom to top layer of thenetwork, thats consistent with the successive block spin transformation. (d) SimilarlyERF for the third layer with 100 spins. (e) Three samples generated from the trainednetwork.

4.5.2 Ising model in 2D

In Sec. 4.1, we have discussed the renormalization of the Ising model in two dimension.

Now we apply deep learning techniques to the two-dimensional Ising model with nearest-

neighbour ferromagnet coupling. The model is defined by the Hamiltonian

H[v]=−J∑<i j>

viv j. (4.27)

The two dimensional Ising model shows the phase transition at J/KBT = 0.4352, and

the system can be coarse-grained (Kadanoff block spin transformation) at this value

because of the divergence of the correlation length.

Motivated by the relation between DNNs and variational RG, we trained the data

generated by the two dimensional Ising model at J = 0.405. By using the standard Monte

Carlo technique, 50,000 samples were generated and used as an input to the four-layered

1600,400,100,25 DBN 4.6(a). Moreover, we employed L1 regularization and used the

66


contrastive divergence method for training. The regularization term makes sure that

most of the weights approach zero and prevents the model from overfitting. Practically,

it ensures the local interaction between the visible and hidden spins 4.6(b)(d). If we

have used the convolutional neural network it would take care of the local interaction by

definition.

Th DNN architecture is similar to the 2×2 block spin renormalization procedure,

hence, implies that the DNN is performing the coarse-graining procedure. Notice that

the coupling range in the block is same in each layer and it is increasing from bottom

to upper layer. The features of Kadanoff block spin renormalization emerge from the

deep architecture which implies that the DNN is employing block spin renormalization.

Moreover, the DNN can reproduce the quality samples even the compression ratio is 64.

The coding detail is provided in the Appendix B.2.

67

CH

AP

TE

R

5CORRESPONDENCE BETWEEN RESTRICTED

BOLTZMANN MACHINE AND TENSOR NETWORK STATES

In chapter 2, we discussed the machine learning basics and energy based models from

deep leaning, more specifically, Boltzmann machines and DBNs. The RBM is an im-

portant building block of deep learning models and finds a wide range of applicability

domain. In chapter 4, we expressed the relation between RBM and RG. In the current

chapter, we construct a bridge between tensor network states (TNS) and RBM. The TNS

extensively used in the quantum many body systems. We will study an efficient algorithm

to convert an RBM into widely used TNS and also the difficulties and conditions if we

move back from TNS to specific RBM architecture. Moreover, we can determine the ex-

pressive ability of the RBM on the complex dataset by using the concept of entanglement

entropy (EE) in TNS. By exploring the TNS and its entanglement extent may lead us to

a guiding principle to design the deep architecture. Alternatively, RBM can describe a

quantum many-body system with relatively fewer parameters than TNS, which enable

us to simulate classical systems more efficiently.

5.1 Transformation of an RBM to TNS

This section is devoted to build a connection between RBM and TNS. The importance of

TNS representation of RBM is that it provides upper limit of the EE it can express. Only

structural information of RBM is enough to estimate the bound. First we discuss an easy

68

CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES

Figure 5.1: Correspondence between RBM-TNS: (a)RBM representation as an undi-rected graphical model as defined by Eq.5.1. The blue circles denote the units v calledvisible, and gray circles labelled as h called hidden units. They interact with each otherthrough links denoted as solid lines. (b) MPS described by 5.5. Each dark blue dot repre-sents a 3 indexed tensor A(i). Now on we use hollow circles to denote RBM units and usefilled ones to indicate tensors. undirected lines in the RBM represents the link weightand lines in the MPS denotes the bond indices and thickness of the bond expresses thebond dimension. RBM and TNS are used to represent complex multi-variable functions,Both have the ability to describe any function with arbitrary precision given unlimitedresources (unlimited hidden variables or bond dimensions). Although, provided thelimited resources they can represent two overlapping but independent regions.

and illustrative method, then present more elegant approaches that can yield optimum

bonds on the TNS bond dimensions. Code is provided in [27]. Before going into details,

we first mention that we follow the notation used in [1]. In that paper the probability

distribution of visible units is interpreted as quantum wave function as follows:

E(v,h)=−∑i

aivi −∑

jb jh j −

∑i, j

viWi jh j,

here v = {vi} and h = {h j} can have ±1 value , the parameters {ai,b j,Wi j} are complex

numbers. And the unnormalized quantum wave function is expressed as

ΨRBM(v)=∑v

e−E(v,h)

=∏i

eaivi∏

j(1+ eb j+

∑i viWi j ).

(5.1)

The common architecture of RBM has dense connectivity between visible and hidden

units but here we use sparse connections (as shown in Fig. 5.1) to make things clear.

Although, the approach we use is general for any RBM architecture.

5.1.1 Direct transformation from RBM to MPS

Lets start with simple and intuitive example of transforming an RBM architecutre given

in Fig. 5.1 to Matrix Product State (MPS). Define hidden and visible units as the virtual

69


and physical indices, this will be the first step of converting RBM to MPS. To achieve this,

we separate out the Boltzmann weights and represent the biases at individual vertices

with Γ(i)v and Γ( j)

h . Each connection between unit j from hidden layer and unit i from

visible layer is defined as M(i j):

Γ(i)v = diag(1, eai ), (5.2)

Γ( j)h = diag(1, eb j ), (5.3)

M(i j) =[

1 1

1 eWi j .

](5.4)

That will be the TNS representation of the RBM given in 5.1. This process is depicted

in the Fig.5.2(a). The MPS parameterized wave function with nv physical degrees of

freedom given as

ΨMPS(v)= Tr∏

iA(i). (5.5)

The second step is to map the TNS (given in Fig.5.2(a)) into MPS. First we divide the

graph into nv pieces as depicted in Fig. 5.2, where nv is the number of visible units. Each

part contains 1 visible unit and the assignment of hidden units to each region is arbitrary.

After that, contract all the units in the region and this will be equal to summing over

hidden units, and merge all the external links in each part to the virtual bond, hence

MPS shown in Fig.5.2(c). The width of the virtual bonds determined by the number of

links incorporate into the bond.

Notice that, we tackle the long range connections: which crosses more than one

vertical cuts, in different manner. In our current example, the only long-range connection

is h1 to v4 as shown in Fig.5.2(b). The matrix M(14) define this connection we break this

matrix into any two 2×2 matrices with constraint M(14) = M1.M2, and include these

into the virtual bonds of MPS. That will be the representation of a long-range connection

to two short-range connections defined by M1 and M2 as shown in Fig. 5.2(d). M1 and

M2 then merge to local tensors A(2) and A(3) (corresponds to v1 and v2), respectively. The

extended connection doubles the bond degrees of freed of every tensor it passes through.

The bond dimension of a particular tensor in the MPS is defined by the number of links

n cut by the vertical dotted line i.e., D = 2n.

70


Figure 5.2: Stepwise mapping an RBM to an MPS: (a) An MPS description of anRBM given in Fig.5.1(a). The light blue circles express the diagonal tensor Γ(i)

v at thevisible units defined by the Eq. 5.2 and gray circle used to denote Γ( j)

h at the hiddenunits as defined by Eq. 5.3. The orange diamonds express the matrix M(i j), described inEq. 5.4. (b) The RBM is divided into nv pieces. Corresponding to each long-range link,put an identity tensor (red ellipse) to subdivide M(i j) into two matrices. (c) An MPS istransformed from RBM by summing up all the hidden units belonging to each piece in (b).The number of cuts (link) made by the dashed vertical line is equal to the bond degreesof freedom of the MPS. (d) The matrix M(41) corresponds to long-range connection isbroken into a product of two M1, M2 matrices, represented by the light pink diamond.The red ellipse shows the product of two identity matrices.

It should be highlighted that the mapping we get by using this approach is not unique.

Because it depends on the configuration of the hidden units in the network. Regardless

of this geometrical dependence of the hidden units, all the different MPS are equal. This

approach also induces redundancy (in terms of degrees of freedom) in the local tensors.

By performing canonical transformation on the local tensors A’s, one can obtain unique

form of MPS [28, 29]. We will discuss the redundancy of the MPS later in this chapter.

5.1.2 Optimum MPS representation of an RBM

The method given in the previous subsection is easy to understand but not an optimal

one. Now lets consider a method that provides an MPS description with optimum bond

degrees of freedom. As already said, the RBM is a bidirectional probabilistic graphical

model. We separate out X g and Yg as two sets of variables which are independent by

71


Algorithm 2 Direct transformation from RBM to MPS.1: RBM2MPS1 (W ,a,b)2: W : weight matrix Nhidden ×Nvisible for RBM.3: a: bias vector for visible units.4: b: bias vector for hidden units.5: Output: An MPS with local tensors A(i), i = 1,2, · · · ,nv.6: The RBM is divided into nv slices Si i = 1,2, · · · ,nv. Each slice Si has one visible unit

vi and many hidden units.7: for all visible units i do8: Set Ti =; (carries the tensors in Si)9: Construct Γ(i)

v by using 5.2.10: for all h j ∈ Si do11: Construct Γ( j)

h by using 5.3.12: Add Γ( j)

h to Ti.13: end for14: end for15: for all Non-zero weights do16: Construct M(i j) by using 5.4.17: Break M(i j) into product of matrices and put back all these matrices into the

corresponding slice(Fig.5.2(d)).18: end for19: for all visible units i do20: A(i) ← Summed up all the internal indices of tensors belong to Ti21: end for

condition if a set of variables Zg is given. The statement is written as

X g ⊥Yg | Zg. (5.6)

For the separation of X g and Yg, we can determine the set Zg with minimum number

of elements to satisfy the 5.6. The set Zg can be used as the MPS virtual bond after the

mapping. The bond dimension between two separated sets (X g and Yg) is determined by

the size of Zg denoted as | Zg |:

D = 2|Zg|. (5.7)

The Algorithm for optimum translation by conditional independence is given in Algo.

5.1.2. We begin from left and progress towards right by constructing tensors one by

one with minimum bond (virtual) dimension. The virtual bonds created in this fashion

describe the degrees of freedom of hidden or visible units of an RBM.

72


Figure 5.3: The optimum MPS representation of an RBM: (a)− (e) Depicts thestepwise construction. The set X g is denoted by light yellow rectangle/triangle and set Ygrepresents a dark blue rectangle. The set Zg which provides the conditional independenceof X g and Yg is represented as a light green rectangle. When the set Zg is given, the RBMfunction which interpreted as probability can be written as a product of two functions,one depends on X g and other depends on Yg. The variables in Zg are defined as thevirtual bond of the MPS. The light gray lines show the connections included in theprevious tensor. The connections being considered in the current step are denoted bydotted lines and these are represented as G t in Algo. 5.1.2. (f) The resulting MPS.

To demonstrate the translation algorithm, let’s apply this method on the RBM given

in Fig. 5.1(a). We begin with Fig.5.3(a), so the X g = {v1} and Yg = {v2, · · · ,v6}. It is easy

to see the minimal set Zg = {h1} or Zg = {v1} is enough to satisfy the condition Eq. 5.6.

Assume that we take Zg = {v1} this implies G t = Ht = ;, and the initial tensor A(1) of

MPS can be specify as an identity matrix that simply duplicates the visible unit vi to its

virtual bond at the right.

Move forward and include v2 in the set X g, so X g = {v1,v2}, the remaining visible

variables are Yg = {v2, · · · ,v6} as shown in Fig.5.3(b). Notice that there are four minimum

possible set which can be Zg but we choose Zg = {h1,h2}, for example. All the connections

needed to merge in the tensor A(2) are shown by dotted lines in Fig.5.3(b) and by G t in

Algo. 5.1.2. Ht =;. The set Zg = {h1,h2} will be the right bond of A(2). The left virtual

bond of A(2) and the right virtual bond of A(1) is the same.

73


Now move to the the Fig.5.3(c). The light gray lines represent the interactions which

are already considered in the previous tensors. According to the Algo. 5.1.2, this is the

3rd iteration and the line 7 shows that X g = {h1,h2}∪ {v3}. Here we have several options

for minimal set Zg with size | Zg |= 2 but we choose Zg = {v3,v4}. The set G t contains all

the connections between Zg and {h1,h2}. All these couplings are included by A(3) tensor

and its right virtual bond consists of Zg. No hidden unit needs to summed up, so Ht =;.

In Fig. 5.3(d), Zg composed of v5 and h4. We observe Ht = {h3} cannot interact with

the whole set Yg when Zg is given. So, h3 should be traced out during construction of

A(4).

During the construction of the MPS, each interaction in the RBM is considered only

once. Hence, we end up with MPS which consists of 6 tensors and the degrees of freedom

of the virtual bonds are labeled in Fig.5.3( f ). Notice that, there is no need to run the

algorithm numerically to obtain the optimum virtual bond dimension of resulting MPS.

Further, this algorithm also works for any bidirectional graphical model.

5.1.3 Inference of RBM to MPS mapping

The translation of an RBM to TNS implies that the expressive power of an RBM can

be quantified by the knowledge of the TNS, in particular, the concept of EE. To define

the EE as a function of Ψ for an RBM or MPS, just split the set of visible units into two

subsets, represented as X g and Yg. The EE between X g and Yg is given as

S =−Trρ lnρ, (5.8)

where ρ is called the reduced density matrix

ρ =∑vy

Ψ∗(vx,vy)Ψ(vx,vy). (5.9)

Here the matrix is consists of two sets of visible units vy and vx, where former composed

on whole set of visible units in Yg and later spans entire set of visible units in X g. The

EE specify the information content of Ψ. The EE for MPS is easy to calculate and it

turns out that S depends on the bond dimension D i.e. lnD. For more detail discussion

on the EE, we would like to refer Chapter 3.

To evaluate the expressive ability of an RBM first we need to find its MPS repre-

sentation with optimum bond dimensions. The bond dimension calculated by the direct

approach given in Fig.5.1 is higher than what actually is needed. For instance, as shown

in Fig.5.1(c), the bond dimension of the second tensor is D = 8 but the two visible units

74


Algorithm 3 RBM to MPS mapping with optimum bond dimension.1: RBM2MPS1 (W ,a,b)2: W : weight matrix Nhidden ×Nvisible for RBM.3: a: bias vector for visible units.4: b: bias vector for hidden units.5: Output: An MPS with local tensors A(i), i = 1,2, · · · ,nv.6: Gs = {(i, j) |W(i, j) 6= 0} (whole connected graph).7: Hs = { j | (i, j) ∈Gs} (entire set of hidden units).8: Z

′g =; (left virtual bond dimension).

9: for all visible units i do10: Set G t =; (enumerate connections in A(i)).11: Set Ht =; (all the hidden variables to be summed in A(i)).12: X g = Z

′g ∪ {vi}.

13: Yg = {vi+1, · · · ,vnv} (remaining physical variables).14: Find the minimum set Zg from Gs to satisfy 5.6.15: for all h j ∈ Hs do16: if h j is disconnected with (Yg \ Zg) then17: Shift j from Hs to Ht (h j will be summed up in A(i)).18: end if19: end for20: for all (p, q) ∈Gs do21: if vp and hq belongs to X g ∪Zg ∪Ht then22: Shift (p, q) from Gs to G t ( The connection between vp and hq will be incorpo-

rated in A(i)).23: end if24: end for25: A(i)

Z′ ,Z[vi]=∑

{h j∈Ht} eaivi+∑

(p,q)∈Gt vpWpqhq+∑j∈Ht b jh j .

26: Z′g ← Zg

27: end for

at the left are enough for conditional independence property to hold and these two

variables have four degrees of freedom. So, here D = 8 is more than sufficient to capture

entanglement. The method given in Sec. 5.2 is improved, does not depends on assignment

of hidden units, and provides optimal bond dimension.

Fig.5.4(a) depicts the RBM (given in Fig.5.1) after summing out entire set of hidden

variables in the RBM graph. The curved lines represent the couplings between visible

variables through hidden variables. If we split the visible variable set into two parts X g

and Yg = {Y1∪Y2}, where Y1 contains the visible variables which have direct connection

to the X , and Y2 composed of the remaining visible units, Eq. 5.1 can be represented as

75


Figure 5.4: (a) RBM after summing out entire set of hidden units. (b) The curved linesrepresent the connections between visible units through hidden units. The whole systemis split into two parts X g and Yg and second one is further split as Yg =Y1 ∪Y2. WhereY1 contains the visible variables that are directly connected to X g. (b) The alternativeway is expressed. (c) The MPS provided by this method has smaller bond dimension ascompared to Fig.5.2.

ΨRBM(v)=ψ(vx,vy1)φ(vy1 ,vy2). (5.10)

This wave function in RBM representation factorizes into the product of the visible

variables in X g and Y2. The EE between the region X g and Yg is obtained by the number

of visible units in Y1, denoted by | Y1 |. So, the dimension of the bond at separation

between X g and Yg is given as D = 2|Y1|. On the other hand, we can also divide the X g

into two sets and apply similar approach as shown in Fig.5.4. Hence, the EE between X g

and Yg is bounded by Smax = min(| X1 |, |Y1 |) ln2.

The resulting MPS (shown in Fig.5.4(c)) has tighter bound on the bond dimension as

compared to Fig. 5.3(c). Even tighter bound can be attained on the MPS by finding the

minimal set, whether they are visible or not, to factorize the two functions. Programming

code to implement these mapping is given in [27].

The bond degrees of freedom of the emerging MPS govern the maximum EE between

the visible variables. Thus, EE gives a precise measure on the expressiveness of an RBM

merely established on its architecture. Anticipating these bond dimensions is an easy

task to do. By the canonicalization, the MPS bond dimension can be reduced further.

Likewise, by arranging the visible units on two-dimensions, the RBM can be mapped

76


to a PEPS. An equivalent method was used [30] to transform MERA into PEPS. This

procedure is convenient for the RBM trained for the two-dimensional gird (for example

images).

If we denote m as the upper bound on the number of units that have direct connection

across the cut in the RBM, then the upper limit of the EE should vary as

Smax ∼ mV (d−1)/d, (5.11)

where V represents the volume and d shows the dimension of the space on which TNS

is described. Hence, the RBM with sparse connectivity respects area-law. In contrast

to that, the RBM with dense connection, the region that cuts the RBM expands to the

entire system and the relation between m and V becomes m ∼V 1/d, thus Smax ∼V . So,

this implies that the densely connected RBM is capable of representing highly entangles

quantum state that violate area-law. And it has polynomial increase in the number of

parameters with the size of the system. But MPS and PEPS representation of this state

requires exponential increase in the number of parameters with the size of the system

[31, 32]. This justifies the variational computation of the quantum system by using RBM

functions [1]. We will discuss this further later in this chapter.

The mapping (RBM to MPS) is used so far can be extended to the general Boltzmann

machine without imposing any restriction.

5.2 Representation of TNS as RBM: sufficient andnecessary conditions

Now we pay attention to the reverse process i.e. transformation of TNS into a given RBM

architecture. Here we only consider a provided architecture because a part from that any

function can be reproduced by the RBM with exponentially large resource [33]. At the

end of this section we will use practical approach for this mapping and will demonstrate

an example.

Lets assume the MPS given Fig.5.1(b) and find its parametric representation as an

RBM by the architecture given in Fig. 5.5(a). The MPS has 6 sites and nh = 4 hidden

units in the RBM architecture. It should be mentioned here that the architecture of the

RBM is given which we would like to get after conversion and we want to know whether

conversion is possible for given architecture or not. The hidden layer factorizes into a

product of 4 tensors. One tensor is defined on h1 with dimensions 2×2×2×2 and all

77


the remaining tensors have dimensions 2×2×2. We require the product of these four

tensors equal to the MPS as:

Tr∏

iA(i)[vi]= T(1)

v1v2v3v4T(2)

v2v3v4T(3)

v3v4v5T(4)

v4v5v6. (5.12)

The logarithm of this equation shows that the 2nv number of linear equations for 40

parameters, so this linear equation system is overdetermined. We need a unique solution

of this system of linear equations for the translation of MPS to RBM. If there is no

solution of these equations then architecture of the RBM should be changed. We can

balance both the number of equations and the number of parameters by varying the

number of both connections and hidden units.

Assume that one has a unique solution of the Eq.5.12, the next step will be decouple

each tensor into RBM (remember that each tensor has 1 hidden unit). For instance,

decouple T(2)v2v3v4 as

T(2)v2v3v4

= ∑h2∈{0,1}

eh2b2+∑k∈{2,3,4} vk(Wk2h2+ak2), (5.13)

where ak j is used to denote the component of the bias of the kth visible imparted

with the jth hidden variable. The ak bias of kth visible variable is given by ak =∑j ak j.

As shown in Fig.5.5(b), this three indexed tensor has 7 parameters and these can be

determined by solving 23 equations. The increase in the number of variational parameters

is linear with the index/order of tensor T. But there is an exponential increase in the

number of equations with the order of tensor. Typically, Eq. 5.13 has no unique solution

because of overdetermined but the solution exists in the exceptional cases.

Practically, Eq.5.13 can be decomposed into a minimum rank by using tensor rank

decomposition [34] of T(2):

T(2)v2,v3,v4

= ∑h2∈{0,1}

Xv2h2Yv3h2 Zv4h2 . (5.14)

Here X ,Y , Z are all 2×2 matrices since the hidden and visible units are all binary.

Because of binary hidden units the rank of T(2) is 2. Tensor rank also depends on the

underlying field of the tensor, for instance, for 2×2×2 the rank-2 decomposition is

common in complex field and if the same tensor decomposed in a real field then the rank

will be 4 but it is not an easy task to obtain the rank of an arbitrary tensor [34]. The

higher-order SVD [35] provides the lower-bound of the rank of the tensor in the form of

the decomposition of the core tensor. The condition of hidden units to be binary will not

satisfy if the rank is greater than 2.

78


Figure 5.5: TNS to RBM transformation: graphical representation of (a) Eq. 5.12 and(b) Eq. 5.13 and 5.14.

.

After decomposing the tensor, each matrix can be factorized into a product of 3

matrices, for instance the X matrix by using Eqs.5.2−5.4

X =[

x11 x12

x21 x22

]= x11

[1

x21x11

][1 1

1 x11x22x12x21

][1

x12x11

]. (5.15)

and

W22 = lnx11x22

x12x21(5.16)

a22 = lnx21

x11(5.17)

b22 = lnx12

x11. (5.18)

Here bk j is a partial bias same as ak j and interpreted as the component of the bias

of the kth hidden variable exploited by the jth visible variable. The bk bias of the kthhidden variable is bk =∑

j bk j. By using this method, all the tensors in Eq.5.12 can be

written as an RBM.

Hence, the essential and sufficient constrain for an MPS to have an RBM represen-

tation is that both Eqs. 5.12 and 5.13 are uniquely solvable. For RBM with just single

79


hidden variable Eq.5.12 simply reinterpret the MPS as a wave function itself. The disin-

tegration of the tensor in Eq.5.13 as a rank-2 tensor is a more difficult condition to meet.

By varying the parameters(in terms of connection or hidden units) of RBM, one can find

the architecture that has a unique solution of both equations. This is equivalent to the

mathematical statement that an RBM can describe any function when an exponential

resources (parameters) are provided.

In practice, an appropriate way to directly determine whether a TNS has specific

RBM description is to test the factorization property defined in Eq.5.10.

5.2.1 Examples

After discussing a sufficient condition for the one or two dimension tensor networks

(MPS or PEPS) to describe as an RBM. Various interesting physical quantum or thermal

states can be transformed under this mapping. For instance, toric code model, statistical

Ising model in the presence of external field etc.

A sufficient mathematical condition for the MPS that can be represented as an RBM

is following:

A i j[v]= L ivRv j. (5.19)

This constrain is for every tensor in MPS, where L and R are left and right matrices

of dimension 2×2. The product of two matrices R and L (dashed box shown in Fig.5.6)

can be replaced to connect the hidden units in an RBM. The bias of the visible variable

becomes half because it is distributed into two consecutive boxes. From Eqs. 5.2−5.4,

the RL matrix can be written in the form of RBM parameters as following:

RL =[

1

ea/2

][1 1

1 ew

][1

eb

][1 1

1 ew

][1

ea/2

]. (5.20)

Here the weights and biases are chosen same for simplicity and this decomposition is

not unique.

One example can be Ising model in one dimension. The partition function with

coupling constant K and the external field H is given as following:

Z =∑si

eK∑

⟨i, j⟩ sis j+H∑

i si .

80


Figure 5.6: The matrix A is defined in Eq. 5.19 has special form that represented by thedashed box. The blue dots denote the identity matrices and square box represent the leftand right matrices. To obtain the RBM parameters we apply transformation of MPS toRBM according to Eq.5.20.

si are Ising spin, each spin can have two values ±1, and can be transformed to binary

variables as vi = (si +1)/2. One dimensional Ising model is easy to represent as an MPS

as shown in Fig.5.6. The RL on an individual bond can be defined as

RL =[

eK+H e−K

e−K eK−H

]. (5.21)

After combining above two equations for RL one can have parameters for an RBM. This

procedure can be extended to two dimensional tensor networks, for example, PEPS, where

each hidden unit can be connected to two visible units. As opposed to one dimensional

case visible biases becomes a/4 and H becomes H/2.

These explorations show that a very modest sparse RBM with equal numbers of

hidden and visible units in 1D case (nh = 2nv in 2D)is enough to replicate the ther-

mal probability distribution of the Ising model. Notice that, this approach is does not

dependent on the couplings and valid even at criticality.

5.3 Implication of the RBM-TNS correspondence

5.3.1 RBM optimization by using tensor network methods

Analogous to TNS, it is common that neural networks including RBM functions restrains

redundancy in terms of degrees of freedom. The RBMs with entirely distinct architectures

can represent same function. So, the well-known tensor network approaches can be used

to remove this redundancy from an RBM.

81


In one dimension, for instance, the canonicalization method of MPS can be used to

find optimal RBM. In order to do this, the first step would be map RBM to MPS by using

method discussed in Sec. 5.1. The MPS can be canonicalized to have minimum bond

dimension of each tensor by truncating approximately zero vectors. This procedure also

fixes the gauge of MPS to some extent. Lastly, translate this simplified TNS back to the

RBM by using the method discussed in Sec. 5.2. The resulting RBM is equal to previous

one, but is optimal.

To illustrate this optimization process, lets take an RBM representation of the

cluster state which discussed in [2]. The architecture is shown in Fig.5.7. This RBM

architecture can be mapped to MPS with D = 4 (each partition breaks 2 links). The

Figure 5.7: The RBM architecture which used to represent the cluster state. It has localconnections, each hidden unit connected with three visible units.

canonical transformation of this MPS results in D = 2 MPS. After that, translate D = 2

MPS back, the derived RBM consists of the hidden layer in which every hidden variables

linked with two visible variables. A higher-dimensional RBM can be also optimized by

translating it onto PEPS. By using SVD the bond-dimension of PEPS can be reduced.

5.3.2 Unsupervised learning in entanglement perspective

A natural outcome of the relationship between RBM and TNS is a quantum entanglement

point of view on unsupervised probabilist models. According to the theorem in machine

learning [33], adding a hidden unit in RBM results in the improvement of model so one

can approximate any function to any accuracy but unlimited resources are needed. One

can approximate the resources or number of hidden units/parameters by introducing the

concept of EE for a real data set, or, similarly, the effective bond degrees of freedom of a

TNS.

What we meant by the EE for a real-valued dataset? The probability amplitude

Ψ(v)=pP(v) (same as used in quantum mechanics) can be defined from the P(v), where

P(v) represents the probability distribution over the dataset. This definition of EE for

82


real-valued data is meaningful, similar to classical information measures it accounts the

complexity of data. Bringing the concept of EE to the machine learning anticipates a

practical and convenient way to quantify the challenges of unsupervised learning. And it

assists further in modeling as well as understanding the quantum many body systems.

These implications are apposite to those machine learning generative models motivated

by quantum mechanics, where authors showed that a wave-function square can be used

to model the probability [36].

Consider a natural images dataset, typically the couplings between the pixels are

strong with the neighbouring ones which imply that the EE introduced above is compar-

atively small. As a result, there is no need to have RBM with dense connections to model

such dataset. In fact, D. Constantine and others have shown [37] that the RBM with

sparse connections performs better than the fully connected. Further, the entanglement

distribution varies point to point in space. The entanglement quantification can be used

as a guiding principle to design a neural network architecture.

One more advantage of proposed quantum entanglement for the real-valued dataset

is that the connection between RBM and TNS may entitle one to embrace strategies

emerge in quantum mechanics straight to machine learning. For instance, it is easy to

determine the upper bound of EE of an RBM by transforming it into TNS and count the

bond dimension. When directly using TNS to train the dataset one can use the EE to

quantify the difficulty of the learning task [38].

5.3.3 Entanglement: a measure of effectiveness of deeplearning as compared to shallow ones

The mapping we have discussed in Sec.5.1 is more general and can be performed on

the general Boltzmann machine without restriction in the graph. More specifically,

performing this method on the deep Boltzmann machine (DBM) explains the difference

between deep and shallow RBM.

Both architectures, RBM and DBM, are shown in Fig.5.8 with the same number of

visible units nv, hidden units nh = 3nv and connections 9nv. As opposed to RBM, the

architecture of the DBM is consists on multi-layer and connections are distributed among

the consecutive layers as depicted in Fig.5.8(b). Following Sec. 5.1.3, the entire system

can be divided into two by fixing the t units. The bond dimensions for DBM is D = 16 and

for RBM is D = 4. Hence, the DBM is capable of representing more complex functions as

compared to RBM when the same number of parameters are given.

83


Figure 5.8: Effectiveness of deep network as compared to shallow network: (a)An RBM and (b) a DBM, both have same nv (blue circles), nh = 3nv and number ofconnections 9nv. The approach discussed in Sec.5.1 can be used to represent both archi-tectures as an MPS with bond dimensions for DBM D = 24 and RBM has D = 22. Dashedrectangle depicts the minimum number of visible units required to fix in order to cutthe system into two subsystems. The bond dimension shows that DBM can encode moreentanglement as opposed to RBM when equal number of hidden units and parametersare provided.

In general, the deep hidden units cause long-range effective links between visible

units in DBM therefore, more entanglement capacity as opposed to RBM with the

equal number of weights and hidden variables. One can analyze and compare between

expressive power of various neural architecture by employing mapping to TNS.

84

AP

PE

ND

IX

AMPS EXAMPLES IN MATHEMATICA

A.1 W State

85

APPENDIX A. MPS EXAMPLES IN MATHEMATICA

A.2 GHZ State

86

AP

PE

ND

IX

BRENORMALIZATION GROUP EXAMPLE AND CODE

DESCRIPTION

B.1 1D Ising model

The procedure we are going to discuss is quite general and apply on system which have

nearest-neighbour interactions. Several of the ideas in RG theory can be demonstrated

with the 1D Ising model. Let’s begin with that physical system despite the fact that it

does not show criticality or phase transition. When external magnetic field not present

the partition function Z is given by

Z(K , N)= ∑v1,v2,...vN=±1

eK(···+v1v2+v2v3+v3v4+··· ), (B.1)

where K = βJ. The RG is a procedure in which we get rid of some degrees of freedom

by tracing out. This is entirely different from MFT approach where very few degrees of

freedom are removed explicitly. In particular, summing over the even numbered spins in

Eq.B.1

Z(K , N)= ∑v1,v3,v5,···=±1

N/2∏i=1

(eK(vi+vi+2) + e−K(vi+vi+2)

). (B.2)

By doing this, we have eliminate half degrees of freedom as shown in Fig.B.1.

Now next step in RG theory is to express the partially summed Z into a partition

function(original) of Ising model with N/2 degrees of freedom and may be different

87

APPENDIX B. RENORMALIZATION GROUP EXAMPLE AND CODE DESCRIPTION

Figure B.1: Tracing out even degrees of freedom.

coupling constants K . If this rescaling/coarse-graining is possible then we can have

recursion relation in Z. The recursion relation allows us to compute Z for that system

after another rescaling. We just have to perform this procedure once then the resulting

set of equations (recursive) enable us to move everywhere in coupling space. In particular,

we require a function g(K) and new coupling constant(s) K′such as eK

(v+v

′)+ e−K

(v+v

′)=

g(K)eK′vv

′,∀v,v

′ =±1. If we succeed in finding g(K) then

Z(K , N)= [g(K)]N/2 Z(K′, N/2), (B.3)

this is the recursion relation which we aim for. This recursion relation is called

Kadanoff transformation. To find g(K) and K′

put all the four combinations of two

spins in the Kadanoff transformation relation that leads us to two equations with two

unknowns, they have solutions

K′ = 1

2ln(cosh(2K)),

g(K)= 2√

cosh(2K) .(B.4)

Define ln(Z) = N f (K), with −kβT, N f (K) will be the free energy. With the help of

above equations we get

f (K′ = 2 f (K)− ln(2

√cosh(2K) )). (B.5)

Eqs.B.4 and B.4 are called RG equations. They hold group properties and do renor-

malization.

Now we discuss results and conceptual ideas of renormalization flows: consecutive

RG transformation generates flow in coupling space. First note that new coarse-grain

coupling K′generated by these RG transformation equations is always less than K . To

see this fixed point put limit K → 0 in Eq. B.4 and result will beK′ = K2 and on the other

end K →∞ we will get K′ = K − 1

2 ln2. These two K = 0,∞ are fixed points but first one

is stable and second one is unstable as shown in Fig. B.2.

88


Figure B.2: RG flow in coupling space, it depicts the stable and unstable fixed points.

As K is variable between 0 and ∞, we can slightly change the variables by using the

trigonometric identity tanh(K′)= e2K

′−1

e2K ′+1

tanh(K′)= tanh2(K). (B.6)

This change of variables also change the domain of RG flow diagram as well, as shown in

Fig.B.3.

Figure B.3: RG flow: (a) coupling space in different domain from 0 to 1, (b)n the presenceof external magnetic field h. The arrows show flow direction and blue lines in betweentwo limits (K = 0,∞) depict flow and these end up on vertical axis h where K = 0. “×”signs on the vertical axis when K = 0 represent the fixed points.

If we introduce the external magnetic field h then we have one more RG transforma-

tion equation corresponding to that. The flow diagram in the presence of h shown in Fig.

B.3. From flow diagram one can realize that, starting with any set of (h,K) as we rescale

the system, we essentially go to independent spins with an effective magnetic field.

The RG is still useful and illustrative even there is no critical behavior exists in the

system. After RG transformation the long distance physics is not changed, so ξ must

be the same. Although the space between lattice is increased by 2(for this example).

At the larger value of K , the correlation length will also large. Under one step of RG

transformation at the fixed point of the system, the correlation length ξ′ = ξ/2 and for

l-steps ξ(h, e−K )= 2lξ(2lh,2l/2e−K ).

89


B.2 Training DBN for the 2D Ising model

The Matlab code used for the training is given in the supplementary material of [16].

First of all Ising model samples are generated and only unsupervised learning phase is

considered. The RBM is trained layer by layer with 200 epoch, momentum 0.5 and mini

bach with 100 examples. The regularization is performed by keeping weight cost 0.0002.

The effective receptive field (ERF) is a method to anticipate the effect of a spin in

the visible layer to the spin in the hidden layer. We represent the ERF matrix of the

layer l by r(l) and l = 0 for visible layer. And it can be computed as r(1) =W1 and iterate

rl = r(l−1)W (l).

90

BIBLIOGRAPHY

[1] Carleo, G., Troyer, M., 2017. Solving the quantum many-body problem with artificial

neural networks. Science 355, 602‚Äì606. https://doi.org/10.1126/science.aag2302

[2] Deng, D.-L., Li, X., Das Sarma, S., 2017. Machine learning topological states. Physi-

cal Review B 96. https://doi.org/10.1103/PhysRevB.96.195145

[3] Deng, D.-L., Li, X., Das Sarma, S., 2017. Quantum Entanglement in Neural Network

States. Phys. Rev. X 7, 021021. https://doi.org/10.1103/PhysRevX.7.021021

[4] Torlai, G., Melko, R.G., 2016. Learning thermodynamics with Boltzmann machines.

Phys. Rev. B 94, 165134. https://doi.org/10.1103/PhysRevB.94.165134

[5] Mehta, P., Schwab, D.J., 2014. An exact mapping between the Variational Renor-

malization Group and Deep Learning. arXiv:1410.3831 [cond-mat, stat].

[6] Chen, J., Cheng, S., Xie, H., Wang, L., Xiang, T., 2018. Equivalence of restricted

Boltzmann machines and tensor network states. Phys. Rev. B 97, 085104.

https://doi.org/10.1103/PhysRevB.97.085104

[7] Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. The MIT Press,

Cambridge, Massachusetts.

[8] Bengio, Y., 2009. Learning Deep Architectures for AI. MAL 2, 1‚Äì127.

https://doi.org/10.1561/2200000006 .

[9] Orus, R., 2014. A Practical Introduction to Tensor Networks: Matrix Product

States and Projected Entangled Pair States. Annals of Physics 349, 117‚Äì158.

https://doi.org/10.1016/j.aop.2014.06.013 .

[10] Bridgeman, J.C., Chubb, C.T., 2017. Hand-waving and Interpretive Dance: An

Introductory Course on Tensor Networks. Journal of Physics A: Mathematical

and Theoretical 50, 223001. https://doi.org/10.1088/1751-8121/aa6dc3 .

91

BIBLIOGRAPHY

[11] UCI Machine Learning Repository [WWW Document], n.d. URL

https://archive.ics.uci.edu/ml/index.php (accessed 4.25.18).

[12] MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris

Burges [WWW Document], n.d. URL http://yann.lecun.com/exdb/mnist/ (ac-

cessed 4.25.18).

[13] Vapnik, V., 2000. The Nature of Statistical Learning Theory, 2nd ed, Information

Science and Statistics. Springer-Verlag, New York.

[14] Larochelle, H., Bengio, Y., 2008. Classification Using Discriminative Restricted

Boltzmann Machines, in: Proceedings of the 25th International Conference

on Machine Learning, ICML ‚Äô08. ACM, New York, NY, USA, pp. 536‚Äì543.

https://doi.org/10.1145/1390156.1390224.

[15] Ackley, D.H., Hinton, G.E., Sejnowski, T.J., 1985. A learning algorithm for boltz-

mann machines. Cognitive Science 9, 147‚Äì169. https://doi.org/10.1016/S0364-

0213(85)80012-4.

[16] Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the Dimension-

ality of Data with Neural Networks. Science 313, 504‚Äì507.

https://doi.org/10.1126/science.1127647.

[17] Hinton, G.E., Osindero, S., Teh, Y.-W., 2006. A Fast Learning Algo-

rithm for Deep Belief Nets. Neural Computation 18, 1527‚Äì1554.

https://doi.org/10.1162/neco.2006.18.7.1527.

[18] Google Code Archive - Long-term storage for Google Code Project Hosting. [WWW

Document], n.d. URL https://code.google.com/archive/p/matrbm/.

[19] You, Y.-Z., Yang, Z., Qi, X.-L., 2018. Machine learning spatial ge-

ometry from entanglement features. Phys. Rev. B 97, 045153.

https://doi.org/10.1103/PhysRevB.97.045153.

[20] Vidal, G., Latorre, J.I., Rico, E., Kitaev, A., 2003. Entanglement

in Quantum Critical Phenomena. Phys. Rev. Lett. 90, 227902.

https://doi.org/10.1103/PhysRevLett.90.227902

[21] Poulin, D., Qarry, A., Somma, R.D., Verstraete, F., 2011. Quantum simulation

of time-dependent Hamiltonians and the convenient illusion of Hilbert space.

Physical Review Letters 106. https://doi.org/10.1103/PhysRevLett.106.170501.

92

BIBLIOGRAPHY

[22] Vidal, G., Latorre, J.I., Rico, E., Kitaev, A., 2003. Entanglement

in Quantum Critical Phenomena. Phys. Rev. Lett. 90, 227902.

https://doi.org/10.1103/PhysRevLett.90.227902.

[23] White, S.R., 1992. Density matrix formulation for quantum renormalization groups.

Phys. Rev. Lett. 69, 2863‚Äì2866. https://doi.org/10.1103/PhysRevLett.69.2863.

[24] Cardy, J., 1996. Scaling and Renormalization in Statistical Physics. Cambridge

University Press.

[25] Chandler, D., 1987. Introduction to Modern Statistical Mechanics.

[26] Kardar, M., 2007. Statistical Physics of Fields [WWW Document]. Cambridge Core.

https://doi.org/10.1017/CBO9780511815881 .

[27] Chen, J., 2018. rbm2mps: MPS representation of RBM.

https://github.com/yzcj105/rbm2mps.

[28] Orus, R., 2014. A practical introduction to tensor networks: Matrix product

states and projected entangled pair states. Annals of Physics 349, 117‚Äì158.


[29] Schollwock, U., 2011. The density-matrix renormalization group in the age of matrix

product states. Annals of Physics, January 2011 Special Issue 326, 96‚Äì192.


[30] Barthel, T., Kliesch, M., Eisert, J., 2010. Real-Space Renormaliza-

tion Yields Finite Correlations. Phys. Rev. Lett. 105, 010502.


[31] Evenbly, G., Vidal, G., 2014. Class of Highly Entangled Many-Body

States that can be Efficiently Simulated. Phys. Rev. Lett. 112, 240502.


[32] Vidal, G., 2007. Entanglement Renormalization. Phys. Rev. Lett. 99, 220405.


[33] Le Roux, N., Bengio, Y., 2008. Representational Power of Restricted Boltzmann

Machines and Deep Belief Networks. Neural Computation 20, 1631‚Äì1649.

https://doi.org/10.1162/neco.2008.04-07-510.

93

BIBLIOGRAPHY

[34] Tensor rank decomposition, 2018. . Wikipedia.

[35] Kolda, T., Bader, B., 2009. Tensor Decompositions and Applications. SIAM Rev. 51,

455‚Äì500. https://doi.org/10.1137/07070111X.

[36] Han, Z.-Y., Wang, J., Fan, H., Wang, L., Zhang, P., 2017. Unsupervised Gener-

ative Modeling Using Matrix Product States. arXiv:1709.01662 [cond-mat,

physics:quant-ph, stat].

[37] Mocanu, D.C., Mocanu, E., Nguyen, P.H., Gibescu, M., Liotta, A., 2016. A topolog-

ical insight into restricted Boltzmann machines. Mach Learn 104, 243‚Äì270.

https://doi.org/10.1007/s10994-016-5570-z.

[38] Stoudenmire, E., Schwab, D.J., 2016. Supervised Learning with Tensor Networks,

in: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (Eds.), Ad-

vances in Neural Information Processing Systems 29. Curran Associates, Inc.,

pp. 4799‚Äì4807.

94

Deep neural networks, Tensor networks, and Renormalization ......Deep neural networks, Tensor...

Documents

Transcript of Deep neural networks, Tensor networks, and Renormalization ......Deep neural networks, Tensor...