Group Meeting - September 04, 2020 - University of Chicago

25
Group Meeting - September 04, 2020 Paper review & Research progress Truong Son Hy ? ? Department of Computer Science The University of Chicago Ryerson Physical Lab Son (UChicago) Group Meeting September 04, 2020 1 / 25

Transcript of Group Meeting - September 04, 2020 - University of Chicago

Group Meeting - September 04, 2020Paper review & Research progress

Truong Son Hy ?

?Department of Computer ScienceThe University of Chicago

Ryerson Physical Lab

Son (UChicago) Group Meeting September 04, 2020 1 / 25

Marcus Aurelius

The happiness of your life depends upon the quality of your thoughts.

February 2020, The Art Institute of Chicago

Son (UChicago) Group Meeting September 04, 2020 2 / 25

Buddha

1 Meditation brings wisdom; lack of meditation leaves ignorance.Know well what leads you forward and what hold you back, andchoose the path that leads to wisdom.

2 A monk is simply a traveler, except the journey is inwards.

Son (UChicago) Group Meeting September 04, 2020 3 / 25

Paper

Graph Representation Learning, William L. Hamilton (McGill University,2020) https://www.cs.mcgill.ca/~wlh/grl_book/

Chapter 8. Traditional graph generation approaches.

Chapter 9. Deep Generative Models.

Son (UChicago) Group Meeting September 04, 2020 4 / 25

Traditional graph generation approaches (1)

Erdos-Renyi (ER) Model:

p(Au,v = 1) = r , ∀u, v ∈ V, u 6= v

where r ∈ [0, 1] is parameter controlling the density of the graph. Edgesare generated independently.

Son (UChicago) Group Meeting September 04, 2020 5 / 25

Traditional graph generation approaches (2)

Stochastic Block Models (SBM):

We specify a number γ as the number of different blocks {C1, ..,Cγ}.Every node u ∈ V has a probability pi of belonging to block i :

pi = p(u ∈ Ci ),γ∑

i=1

pi = 1

Block-to-block edge probabilities are specified by a matrix C ∈ [0, 1]γ×γ

where Ci ,j is the probability of an edge occuring between a node in blockCi and a node in block Cj .SBM is able to generate graphs that exhibit community structure.

Son (UChicago) Group Meeting September 04, 2020 6 / 25

Traditional graph generation approaches (3)

Preferential Attachment (PA) [Albert and Barabasi, 2002]:

Assumption: Real-world graphs exhibit power law degree distribution.The probability of a node u having degree du is:

p(du = k) ∝ k−α, α > 1

Intuitively, the heavy tailed power distribution leads to a large num-ber of nodes with small degrees, but a small number of nodes withextremely large degrees.

Son (UChicago) Group Meeting September 04, 2020 7 / 25

Traditional graph generation process (4)

Preferential Attachment (PA) [Albert and Barabasi, 2002]:

Generative process:1 Initialize a fully connected graph with m0 nodes.2 Iteratively add n −m0 nodes to this graph. At iteration t, we add a

new node u to the graph, and we connect u to m < m0 existing nodesaccording to the following distribution:

p(Au,v = 1) =d

(t)v∑

v ′∈V(t) d(t)v ′

where d(t)v denotes the degree of node v at iteration t and V(t) denotes

the set of nodes that have been added to the graph up to iteration t.

The generation process is autoregressive (iterative approach). Ithink it is similar to Dirichlet Process in which high-degree nodes tendto have more connections.

Son (UChicago) Group Meeting September 04, 2020 8 / 25

Deep Generative Models

Traditional methods

Limitation:

1 Rely on a fixed, hand-crafted generation process.

2 Lack the ability to learn a generative model from data.

Generative models

Goal: Design models that can observe a set of graphs {G1, ..,Gn} andlearn to generate graphs with similar characteristics as this training set.

All-at-once: VAEs, GANs.

Autoregressive (generate a graph incrementally): LSTM-basedlanguage models, GRNN, GRANs, etc.

Son (UChicago) Group Meeting September 04, 2020 9 / 25

Variational Autoencoder Approaches (1)

Main components:

1 Probabilistic encoder qθ(Z |G) defines a distribution over latent rep-resentations. We specify the latent conditional distribution as Z ∼N (µθ(G), σθ(G)), where µθ and σθ are neural networks that generatethe mean and variance parameters for a normal distribution (that wesample latent embeddings Z from).

2 Probabilistic decoder pθ(A|Z ), from which we can sample realisticgraphs (i.e. adjacency matrices) A ∼ pθ(A|Z ) by conditioning on alatent variable Z .

3 Prior distribution p(Z ) over the latent space. Assume Z ∼ N (0, 1).

Son (UChicago) Group Meeting September 04, 2020 10 / 25

Variational Autoencoder Approaches (2)

Motivation

Motivated from the theory of variational inference [Wainwright andJordan, 2008] (that I presented 3 weeks ago).

Given a set of training graphs {G1, ..,Gn}, we can train a VAE model byminimizing the evidence likelihood lower bound (ELBO):

L =∑

Gi∈{G1,..,Gn}

Eqθ(Z |Gi )[pθ(Gi |Z )]− KL(qθ(Z |Gi )||p(Z ))

Basic idea:

Maximize the reconstruction ability of our decoderEqθ(Z |Gi )[pθ(Gi |Z )].

Minimize the KL-divergence between our posterior latent distributionqθ(Z |Gi ) and the prior p(Z ).

Son (UChicago) Group Meeting September 04, 2020 11 / 25

Variational Autoencoder Approaches (3)

Son (UChicago) Group Meeting September 04, 2020 12 / 25

Node-level Latents – VGAE (1)

Key idea:

First proposed by [Kipf and Welling, 2016]: Variational GraphAutoencoder (VGAE).

However, the authors proposed VGAE model as an approach togenerate node embeddings, but they did not intend it as a generativemodel to sample new graphs.

The encoder generates latent representations for each node in thegraph.

The decoder takes pairs of embeddings as input and uses theseembeddings to predict the likelihood of an edge between the twonodes.

Son (UChicago) Group Meeting September 04, 2020 13 / 25

Node-level Latents – VGAE (2)

The encoder has two separate GNNs to generate mean and varianceparameters, conditioned on the input:

µZ = GNNµ(A,X ) ∈ R|V|×d

log σZ = GNNσ(A,X ) ∈ R|V|×d

Reparameterization trick to sample a set of latent node embeddings:

Z = ε� exp(log(σZ )) + µZ ∈ R|V|×d , ε ∼ N (0, 1)

[Kipf and Welling, 2016] employs a simple dot-product decoder:

pθ(Au,v = 1|zu, zv ) = γ(zTu zv )

pθ(G|Z ) =∏

(u,v)∈V2

pθ(Au,v = 1|zu, zv )

Son (UChicago) Group Meeting September 04, 2020 14 / 25

Graph-level Latents – GraphVAE (1)

Proposed by Simonovsky and Komodakis, 2018:

GraphVAE only has a single graph - level embedding:

zG ∼ N (µzG , σzG )

In contrast, VGAE defines a posterior distributions for each node.

The encoder:

µzG = POOLµ(GNNµ(A,X )) ∈ R|V|×d

σzG = POOLσ(GNNσ(A,X )) ∈ R|V|×d

where POOL : R|V|×d → Rd denotes a pooling function that maps amatrix of node embeddings Z ∈ R|V|×d to a graph - level embeddingvector zG ∈ Rd .

Son (UChicago) Group Meeting September 04, 2020 15 / 25

Graph-level Latents – GraphVAE (2)

With a Bernoulli distributional assumption, the decoder uses an MLP tomap the latent vector zG to a matrix A ∈ [0, 1]|V|×|V| of edge probabilities:

A = γ(MLP(zG))

The posterior distribution:

pθ(G|zG) =∏

(u,v)∈V×V

Au,vAu,v + (1− Au,v )(1− Au,v )

where A denotes the true adjacency, and A denotes the predicted edgeprobabilities.

Son (UChicago) Group Meeting September 04, 2020 16 / 25

Graph-level Latents – GraphVAE (3)

Key challenges:1 Assumption of independent Bernoulli distributions for each edge →

Graphical models!2 Assumption of a fixed number of nodes → Specify empirical

distribution of graph sizes from the training data.3 We do not know the correct ordering (graph matching):

pθ(G|zG) = maxπ∈Π

∏(u,v)∈V×V

Aπu,vAu,v + (1− Aπ

u,v )(1− Au,v )

Specify a particular ordering function π:

pθ(G|zG) ≈∏

(u,v)∈V×V

Aπu,vAu,v + (1− Aπ

u,v )(1− Au,v )

or a bit smarter:

pθ(G|zG) ≈∑

πi∈{π1,..,πn}

∏(u,v)∈V×V

Aπiu,vAu,v + (1− Aπi

u,v )(1− Au,v )

Son (UChicago) Group Meeting September 04, 2020 17 / 25

Adversarial Approaches (1)

Basic idea of GAN-based generative models:

Define a trainable generator network gθ : Rd → X that takes arandom seed z ∈ Rd as input, and generates realistic (but fake) datasamples x ∈ X .

Define a discriminator network dφ : X → [0, 1] that outputs theprobability a given input is fake (distinguishing between real datasamples x ∈ X , and samples generated by the generator x ∈ X ).

Both the generator and discriminator optimized jointly in an adversarialgame:

minθ

maxφ

Ex∼pdata(x)[log(1− dφ(x)] + Ez∼pseed(z)[log(dφ(gθ(z)))]

where pdata(x) denotes the empirical distribution of real data samples, andpseed(z) denotes the random seed distribution (e.g. standard multivariatenormal).

Son (UChicago) Group Meeting September 04, 2020 18 / 25

Adversarial Approaches (2)

[De Cao and Kipf, 2018] (less successful than VAE-based methods) Thegenerator generates a matrix of edge probabilities given a seed vector z :

A = γ(MLP(z))

Given this matrix of edge probabilities, we generate a discrete adjacencymatrix A ∈ Z|V|×|V| by sampling independent Bernoulli variables for eachedge Au,v ∼ Bernoulli(Au,v ). The discriminator is any GNN-based graphclassification model.

Advantage and Disadvantage:

1 GAN-based models do not require any node ordering (vs. VAE-basedmodels).

2 Training minimax optimization is difficult.

Son (UChicago) Group Meeting September 04, 2020 19 / 25

Autoregressive Models (1)

To model edge dependencies, we define the likelihood of a graph given alatent representation z by decomposing the overall likelihood into a set ofindependent edge likelihoods:

p(G|z) =∏

(u,v)∈V×V

p(Au,v |z)

In autoregressive approach, we assume that edges are generated sequentiallyand that the likelihood of each edge can be conditioned on the edges thathave been previously generated:

p(G|z) =

|V|∏i=1

p(Lvi |Lv1 , ..,Lvi−1 , z)

where L denotes the lower-triangle portion of the adjacency matrix A.

Son (UChicago) Group Meeting September 04, 2020 20 / 25

Autoregressive Models (2)

[You et al., 2018] GraphRNN:

1 Graph-level RNN maintains a hidden state hi , which is updated aftergenerating each row of the adjacency matrix Lvi :

hi+1 = RNNgraph(hi ,Lvi )

2 Node-level RNN generates the entries of Lvi in an autoregressivemanner. RNNnode takes the graph-level hidden state hi as an initialinput, and then sequentially generates the binary values of Lvi

assuming a conditional Bernoulli distribution for each entry.

Son (UChicago) Group Meeting September 04, 2020 21 / 25

Autoregressive Models (3)

[Lieo et al., 2019] GRAN (graph recurrent attention networks) modelsthe conditional distribution of each row of the adjacency matrix by runninga GNN on the graph that has been generated so far.

Son (UChicago) Group Meeting September 04, 2020 22 / 25

Evaluating graph generation (1)

[Liao et al., 2019] The current practice is to analyze different statistics ofthe generated graphs, and to compare the distribution of statistics for thegenerated graphs to a test set.

Assume we have a set of graph statistics S = (s1, s2, .., sn) where each ofthese statistics si ,G : R→ [0, 1] is assumed to define a univariate distributionover R for a given graph G. For example:

Degree distribution

Distribution of clustering coefficients

Distribution of different motifs or graphlets

Son (UChicago) Group Meeting September 04, 2020 23 / 25

Evaluating graph generation (2)

Given a particular statistic si , total variance distance betwen the statisticsof a test graph si ,Gtest and a generated graph si ,Ggen :

d(si ,Gtest , si ,Ggen) = supx∈R|si ,Gtest − si ,Ggen |

To measure the performance, we compute the average pairwise distributionaldistance between a set of generated graphs and graphs in a test set (e.g.Wasserstein distance).

Molecule generation (special case):

Graph structures need to be both valid (e.g. chemically stable), andideally have some desirable properties (e.g. medicinal properties, orsolubility).

Unlike general graph generation, molecule generation can benefitsubstantially from domain-specific knowledge for both model designand evaluation strategies.

Son (UChicago) Group Meeting September 04, 2020 24 / 25

Evaluating graph generation (3)

Son (UChicago) Group Meeting September 04, 2020 25 / 25