Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping...

38
Network embedding Cheng Zheng

Transcript of Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping...

Page 1: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Network embeddingCheng Zheng

Page 2: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Outline

● Problem definition● Factorization based algorithms --- Laplacian Eigenmaps(NIPS, 2001)● Random walk based algorithms ---DeepWalk(KDD, 2014), node2vec(KDD,

2016)● Deep learning based algorithms --- SDNE(KDD, 2016), etc.● Summary

Page 3: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Network embedding definition

Given a graph G = (V, E), node collection V = {v1,...,vn}, edge collection E = {eij}, weight associated with each edge W = {sij}

a network embedding is a mapping f: vi -> yi ∈ Rd such that d << |V |.

DeepWalk KDD.2014

Page 4: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Motivation of network embedding

● Like word embedding, distributed representation is better than one-hot representation. Arithmetic operation is possible, such as similarity measure.

● For graph, adjacency matrix tends to be sparse. High dimensional adjacent vector can be noisy

● Dimension reduction favors the downstream supervised tasks

Thanks to Yunsheng and Qin for discussion!

Page 5: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Page 6: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Challenges

Structure to preserve: What is a good vector representation? 1st order proximity, 2nd proximity or global structure.

Sparsity of real world graph data

Scalability: real world graphs are extremely large

Page 7: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Applications

Unsupervised representation learning:

● Graph visualization● Node Classification● Link prediction● Graph reconstruction

Serve as evaluation methods

DeepWalk. KDD.2014Wenchao Yu et al. IJCAI. 2017

Page 8: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Outline

● Problem definition● Factorization based algorithms --- Laplacian Eigenmaps(NIPS, 2001)● Random walk based algorithms ---DeepWalk(KDD, 2014), node2vec(KDD,

2016)● Deep learning based algorithms --- SDNE(KDD, 2016), etc.● Summary

Laplacian Eigenmaps NIPS. 2002Neural computation, 2003

Page 9: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Matrix Factorization

Graph can be naturally represented as matrices.

● Adjacency matrix: A● Laplacian matrix: L = D - W, W weight matrix, D is degree matrix.● Node transition probability matrix

Laplacian Eigenmaps(LE):

Initialized for high dimensional data to do dimension reduction.

Page 10: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

LE Algorithm

Input: undirected graph G(V, E, W), W is weight matrix, Wij represents the weight of node i and node jCompute: Degree matrix Dii = ΣjWji; Graph Laplacian L = D - WCompute eigenvalues and eigenvectors for generalized eigenvector problem:

Let f0,...fn-1 be the eigenvectors and sorted by their eigenvalues 0=λ0≤λ1≤...λn-1

Use next m eigenvectors for embedding in m-dimensional space(m<<n)(f1, f2,..., fm)

Page 11: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Why LE a good embedding?

The objective function of LE(assuming 1d embedding):

Heavy penalty if neighboring points i, j’s embedding yi and yj are far apart

L is a positive semidefinite matrix So, take f1 as the embedding

For m-dimension embedding Take (f1, f2,..., fm) as embedding

min

Page 12: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Problem with LE

works only for undirected graph

For directed graph, graph laplacian is not positive semidefinite.

Scalability: eigendecomposition is slower than O(n2)

Page 13: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Other matrix factorization methods

● Locally Linear Embedding(LLE, science 2000)

● Graph Factorization (GF, WWW 2013)

First graph embedding in O(|E|) time

Page 14: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Outline

● Problem definition● Factorization based algorithms --- Laplacian Eigenmaps(NIPS, 2001)● Random walk based algorithms ---DeepWalk(KDD, 2014), node2vec(KDD,

2016)● Deep learning based algorithms --- SDNE(KDD, 2016), etc.● Summary

DeepWalk KDD 2014node2vec KDD 2016

Page 15: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Similarity between graph and language

Scale free graph

Random walk in a graph

language sentences

Page 16: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Power laws

A scale-free network is a network whose degree distribution follows a power law(Zipf’s Law).

http://konect.uni-koblenz.de/

Web tracker(68M nodes) Twitter(41M nodes)

Vertex frequency in random walks on scale free graphs also follows a power law.

Short random walks = sentences

Page 17: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Language modeling

Learning a representation means learning a mapping function from word co-occurrence

[Rumelhart+, 2003]

We hope that the learned representations capture inherent structure

The goal is to estimate the likelihood of observing vertex vi given all the previous vertices visited so far in the random walk.

After mapping

Proposed by word2vec(NIPS, 2013):

SkipGram language model

Page 18: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Hierarchical SoftmaxCalculating involves O(V) operations for each update. Instead:

C1

C2

C3

Each of {C1, C2, C3} is a logistic binary classifier.

● Consider the graph vertices as leaves of a balanced binary tree.

● Maximizing is equivalent to maximizing the probability of the path from the root to the node. specifically, maximizing

Page 19: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

DeepWalk algorithm

● Learned parameters:○ Vertex representations○ Tree binary classifiers weights

● Generate truncated random walks● Randomly initialize the representations.● For each {C1, C2, C3} calculate the loss function.● Use Stochastic Gradient Descent to update both the classifier weights and

the vertex representation simultaneously

Page 20: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Multi-Label Classification

Node classification on network

2

1

3

4

6

5

UCLAstudents

GooglersINPUT

A partially labelled graph with node attributes.

OUTPUT

Attributes for nodes which do not have them.

Parallelizability

Page 21: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

DeepWalk’s message

Language Modeling techniques can be used for online learning of network representations.

Page 22: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

node2vec: how to define neighborhood?

Goal: learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes.

● Homophily: nodes that belong to similar communities should be embedded closely together

● Structural Equivalence: nodes that have similar structural roles in networks should be embedded closely together

Visualization of Les Miserables by Victor Hugo. Each node is a character in the novel

Page 23: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

BFS or DFS

BFS → Structural Equivalence

DFS → Homophily

Structural equivalence is based on immediate neighborhoods, providing microscopic view.

Homophily reflect a macro-view and infer communities based on distances.

Page 24: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Biased random walk

Random walk distribution

Search bias

Return parameter, p: control the likelihood of immediately revisiting a node in the walkIn-out parameter, q: balance BFS and DFS

Page 25: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Link prediction result

Not very significant improvement over DeepWalk and LINE

Page 26: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

My thinking of node2vec

Consuming too much memory: issue on GitHub reports that transition probability function fails on 65k node graph. And small graph takes much longer time compared with DeepWalk and LINE(based on my experiments).

The improvements over DeepWalk is not that significant.

Randomness introduces noise

Page 27: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Other random walk methods

LINE: Large-scale Information Network Embedding (WWW, 2015)

It uses a probabilistic decoder and loss, but it explicitly factorizes first- and second-order proximities, instead of combining them in fixed-length random walks.

WalkLets (ASONAM, 2017) extend the DeepWalk algorithm to learn embeddings using random walks that “skip” or “hop” over multiple nodes at each step

Page 28: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Outline

● Problem definition● Factorization based algorithms --- Laplacian Eigenmaps(NIPS, 2001)● Random walk based algorithms ---DeepWalk(KDD, 2014), node2vec(KDD,

2016)● Deep learning based algorithms --- SDNE(KDD, 2016), etc.● Summary

SDNE KDD 2016

Page 29: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Autoencoder

Autoencoder is an artificial neural network, composed of encoder and decoder network.

The aim is to learn a encoding of the input data and typical objective is to minimize the difference between output and input.

Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. science(2006)

Page 30: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Embedding with autoencoder

xi is the input

yi(k) can be used as the embedding

Reverse the process to get

Minimize the reconstruction error of the output and the input.

Page 31: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

1st and 2nd proximity

⊙ is Hadamard product,If si,j = 0, bi,j = 1, else bi,j = β > 1. More penalty to the reconstruction error of the non-zero elements than that of zero elements.

Supervised loss, similar to Laplacian Eigenmaps

Page 32: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Model training

Training objective Optimization

?

Not working for directed graph

Page 33: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Network reconstruction

● Use node embedding to reconstruct the network(edges)

● Evaluation Metrics

○ precision@k

○ Mean average precision(MAP)

Page 34: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Other DL methods

Deep Neural Graph Representations(DNGR, AAAI 2016): Random surfing model to capture graph structural information. Also use autoencoder to extract the feature in graph matrix.

Variational Graph Auto-Encoders(NIPS 2016): use graph convolutional network as encoder, with inner product decoder. The same objective function as AE.

Page 35: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Thinking of DL methods

Advantage: Better model to learn the nonlinearity in network data.

Powerful deep learning in representation learning.

Disadvantage: Tends to be too much parameters and time consuming.

Not scalable to real world network dataset.

Some unclear DL model in graph domain: Generative model.

Page 36: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Summary

test

Category Year Method Time complexity Properties preserved

Factorization 2001 NIPS Laplacian Eigenmaps O(|V|2) 1st order proximity

Random Walk

2014 KDD DeepWalk O(|V|) Window size proximity

2016 KDD node2vec O(|V|) Window size proximity

Deep learning 2016 KDD SDNE O(|V||E|) 1st & 2nd proximity

Page 37: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

References

1. Belkin, Mikhail, and Partha Niyogi. Advances in neural information processing systems (2002): 585-591.2. Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. Proceedings of the 20th ACM SIGKDD international conference

on Knowledge discovery and data mining. ACM, 2014.3. Grover, Aditya, and Jure Leskovec. Proceedings of the 22nd ACM SIGKDD international conference on

Knowledge discovery and data mining. ACM, 2016.4. Tang, Jian, et al. Proceedings of the 24th International Conference on World Wide Web. International World Wide

Web Conferences Steering Committee, 2015.5. Wang, Daixin, Peng Cui, and Wenwu Zhu. Proceedings of the 22nd ACM SIGKDD international conference on

Knowledge discovery and data mining. ACM, 2016.6. Hamilton, William L., Rex Ying, and Jure Leskovec. "Representation Learning on Graphs: Methods and

Applications." arXiv preprint arXiv:1709.05584 (2017). (Review)7. Goyal, Palash, and Emilio Ferrara. "Graph Embedding Techniques, Applications, and Performance: A Survey."

arXiv preprint arXiv:1705.02801 (2017). (Review)

Page 38: Network embeddingyunshengb.com/wp-content/uploads/2017/10/network_embedding_re… · mapping function from word co-occurrence [Rumelhart+, 2003] We hope that the learned representations

Thanks!