Learning Representations of Large-Scale Networksir.sdu.edu.cn/~zhuminchen/RL/tangjian2017.pdf ·...

Learning Representations of

Large-Scale Networks Jian Tang

HEC Montréal Montréal Institute of Learning Algorithms (MILA) [email protected]

1

mailto:[email protected]

2

Networks

Social Network

World Wide Web Internet-of-Things

Road Network

• Ubiquitous in real world

• A flexible and general data structure • Many types of data can be formulated as networks

3

Continental Airline: Source: http://www.airlineroutemaps.com/

4 Graph from Albert-László Barabási’ s SIGIR09 keynote

5

Gene-Regulatory Network - Abdollahi et al. Transcriptional network governing the angiogenic switch in human pancreatic cancer. PNAS vol.

104 no. 31,2007

Network Mining: Link Prediction

Does she know Richard Gere?

6

Network Mining: Ranking

Co-citation network of Sloan Digital Sky Survey - http://nevac.ischool.drexel.edu/~james/infovis09/FP-tree-visual.html

• Importance of vertices • Which is the most influential paper?

7

Network Mining: Community Detection

Coauthor

Network Information retrieval

Machine learning Data mining

Who tend to work together? - Q.Mei, D.Cai, D.Zhang, and C.Zhai, Topic Modeling with Network Regularization, WWW 2008

8

Network Mining: Classification

• d1 is democratic • d2 is republican • What can we say about d3 and d4?

- Graph from Jerry Zhu’s tutorial in ICML 07

9

Network Mining: Many Other Tasks

• Sampling

• Recommendation

• Structure analysis (e.g., structural holes)

• Evolution

• Matching

• Visualization

• „

10

11

Traditional Representations of Networks

• Suffer from data sparsity • Suffer from high dimensionality • Does not facilitate computation • Does not represent “semantics” • „

Research Question and Challenges (1)

• How to effectively and efficiently represent a Large-scale network?

• Challenges:

• Large-scale: millions of nodes and billions of edges

• Heterogeneous: directed/undirected, and binary/weighted

12

13

Learning Node Representations for Networks

• Node Classification • Node Clustering • Link Prediction • Recommendation • „

• Word representation

Unstructured text

Text representation, e.g., word and document representation, „

„

Deep learning has been attracting increasing attention „

A future direction of deep learning is to integrate unlabeled data „ The Skip-gram model is quite effective and efficient „ „

degree

network

edge

node word

document

classification

text

embedding

Word co-occurrence network

• E.g., Facebook social network -> user representations (features)-> friend recommendation

Network Node representations

Extremely Low-dimensional Representations: 2D/3D for Visualizing Networks

14

„. „.

„. „. „.

„. Networks 2D/3D Layout

Heatmaps

Network Diagrams Scatter Plots

„.

High-dimensional Data

Visualizing Scientific Papers (Tang et al. 2016)

15

10M Scientific Papers on One Slide

In some cases, we are given a collection of small networks„

16

Sentence dependency graph

Molecule networks

• A collection of small networks with different structures

„ Ego-networks

Research Question and Challenges (2)

• How to learn an effective representation for an entire network?

• Challenges:

• The structures of different networks are different

17

Outline

• Part I: Learning Node Representations of Networks

• Related Work: Laplacian Eigenmap, Word2Vec

• LINE, DeepWalk, and Node2Vec

• Extensions

• Part II: Visualizing Networks and High-Dimensional Data

• t-SNE

• LargeVis

• Pat III: Learning Representations of Entire Networks

• CNN

• Neural Message Passing

• Part IV: Summary, Challenges & Future Work

18

Outline




• Extensions


• t-SNE

• LargeVis


• CNN



19

Problem Definition: Node Embedding

20

Networks Node representations

• Given a network/graph G=(V, E, W), where V is the set of nodes, E is the set of edges between the nodes, and W is the set of weights of the edges, the goal of node embedding is to represent each node i with a vector , which preserves the structure of networks.

Related Work

• Classical graph embedding algorithms • MDS, IsoMap, LLE, Laplacian Eigenmap, „ • Hard to scale up

• Graph factorization (Ahmed et al. 2013) • Not specifically designed for network representation • Undirected graphs only

• Neural word embeddings (Bengio et al. 2003) • Neural language model • word2vec (skipgram), paragraph vectors, etc.

21

Laplacian Eigenmap (Belkin and Niyogi, 2003)

22

• Intuition: the embeddings of similar nodes should be close to each other

• Objective:

• Where , L is the Laplacian matrix , and

• Optimization by finding the eigenvectors of smallest eigenvalues of the Laplacian matrix L:

• Computationally expensive for finding eigenvectors when networks are very big

L =D-W Dii = wijj

å

Lu = lDu

Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 2003.

Skipgram (Mikolov et al. 2014)

23

• Goal: represent each word i with a vector by training from a sequence

• Distributional hypothesis (John Rupert Firth): You know a word by the company it keeps

• Skip-gram: learning word representations by predicting the nearby words within a window

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. NIPS 2014

Skipgram

24

• Objective:

• Where c is the window size • Direct optimization is computationally expensive due to the softmax function

• Negative sampling:

• Where is a noisy distribution

LINE: Large-scale Information Network Embedding (Tang et al., Most Cited Paper of WWW 2015)

• Arbitrary types of networks • Directed, undirected, and/or weighted

• Clear objective function • Preserve the first-order and second-order proximity

• Scalable • Asynchronous stochastic gradient descent • Millions of nodes and billions of edges: a coupe of hours on a single machine

25

Jian Tang, Meng Qu, Mingzhe Wang, Jun Yan, Ming Zhang and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. WWW’15

First-order Proximity

26

• The local pairwise proximity between the nodes • However, many links between the nodes are not observed

• Not sufficient for preserving the entire network structure

Second-order Proximity

27

• Proximity between the neighborhood structures of the nodes

“The degree of overlap of two people’s friendship networks correlates with the strength of ties between them” --Mark Granovetter

“You shall know a word by the company it keeps” --John Rupert Firth

Preserving the First-order Proximity (LINE 1st)

28

• Distributions: : (defined on the undirected edge i - j)

• Objective:

p̂1(vi,v j ) =wij

wmn(m,n)ÎE

å

O1 =KL( p̂1, p1) = - wij log p1(vi,v j )(i, j )ÎE

å

: Embedding of i

Empirical distribution of first-order proximity:

Model distribution of first-order proximity:

Preserving the Second-order Proximity (LINE 2nd)

29

• Distributions: (defined on the directed edge i -> j)

• Objective:

p̂2(v j | vi ) =wij

wikkÎV

å

O2 = KL( p̂2(× | vi ), p2(× | vi ))i

å = - wij log p2(v j | vi )(i, j )ÎE

å

Empirical distribution of neighborhood structure:

Model distribution of neighborhood structure:

Optimization Tricks

30

• Stochastic gradient descent + Negative Sampling • Randomly sample an edge and multiple negative edges

• The gradient w.r.t the embedding with edge (i, j)

• Problematic when the variances of weights of the edges are large • The variance of the gradients are large

• Solution: edge sampling • Sample the edges according to their weights and treat the edges as binary

• Complexity: O(d*K*|E|) • Linear to the dimensionality d, the number of negative samples K, and the number of edges

Discussion

31

• Embed nodes with few neighbors • Expand the neighbors by adding higher-order neighbors • Breadth-first search (BFS) • Adding only second-order neighbors works well in most cases

• Embed new nodes • Fix the embeddings of existing nodes • Optimize the objective w.r.t. the embeddings of new nodes

DeepWalk (Perozzi et al. 2014)

32

• Learning node representations with the technique for learning word representations, i.e., Skipgram

• Treat random walks on networks as sentences

Random walk generation (generate node contexts through random search)

Predict the nearby nodes in the random walks

Bryan Perozzi, Rami Al-Rfou, Steven Skiena. DeepWalk: Online Learning of Social Representations. KDD’14

DeepWalk (Perozzi et al. 2014)

33

• Optimization: hierarchical softmax (Morin, Bengio, 2005)

• Assign the nodes to the leaves of a binary tree • Predict the node => predict a path in the tree

• Make binary decisions along the path

• Complexity from |V| to log(|V|)

Predict the nearby nodes in the random walks (v1->v3, v1->v5)

v1

Hierarchical softmax

Node2Vec (Grover and Leskovec, 2016)

34

• Find the node context by a hybrid strategy of • Breadth-first Sampling (BFS): homophily • Depth-first Sampling (DFS): structural equivalence

Aditya Grover and Jure Leskovec. node2vec: Scalable Feature Learning for Networks. KDD’16

Expand Node Contexts with Biased Random Walk

35

• Biased random walk with two parameters p and q • p: controls the probability of revisiting a node in the walk • q: controls the probability of exploring “outward” nodes • Find optimal p and q through cross-validation on labeled data

• Optimized through similar objective as LINE with first-order proximity

Comparison between LINE, DeepWalk, and Node2Vec

36

Algorithm Neighbor Expansion

Proximity Optimization Labeled Data

LINE BFS 1st or 2nd Negative Sampling No

DeepWalk Random 2nd Hierarchical Softmax

No

Node2Vec BFS + DFS 1st Negative Sampling Yes

Applications

37

• Node classification (Perozzi et al. 2014, Tang et al. 2015a, Grover et al. 2015 )

• Node visualization (Tang et al. 2015a) • Link prediction (Grover et al. 2015) • Recommendation (Zhao et al. 2016) • Text representation (Tang et al. 2015a, Tang et al. 2015b)

• „

Node Classification

38

Table: Results on Flickr Network (Perozzi et al. 2014)

• social network => user representations (features) => classifier

• Community identities as classification labels

DeepWalk > Laplacian Eigenmap

Node Classification

39

Table: Results on Youtube Network(Tang et al. 2015a)

• social network => user representations (features) => classifier

• Community identities as classification labels

LINE(1st + 2nd ) >LINE(2nd )> DeepWalk > LINE(1st )

Node Visualization (Tang et al. 2015a)

40

• Coauthor network: authors from three different research fields

“Data mining” “Machine learning” “Computer vision”

(a) Graph factorization (b) DeepWalk (c) LINE(2nd )

Link Prediction (Grover and Leskovec, 2016)

41

Node Embeddings (LINE, DeepWalk, node2vec) > Jaccard’s Coefficient > Adamic-Adar

Table: Results of Link Prediction

Unsupervised Text Representation (Tang et al. 2015a)

degree

network

edge

node word

document

classification

text

embedding


Unstructured text


„

Deep learning has been attracting increasing attention „

A future direction of deep learning is to integrate unlabeled data „

The Skip-gram model is quite effective and efficient „ Information networks encode the relationships between the data objects „ text

information

network

word …

classification

doc_1

doc_2

doc_3

doc_4 …

…

Word-document network

42

• Construct text networks from unstructured text

Word Analogy (Tang et al. 2015a)

• Entire Wikipedia articles => word co-occurrence network (~2M words, 1B edges)

LINE(2nd) > LINE(1st)

LINE(2nd) > SkipGram

• Size of word co-occurrence networks does not grow linearly with data size • Only the weights of edges change

43

Algorithm Semantic(%) Syntactic(%) Overall

GF 61.38 44.08 51.93

SkipGram 69.14 57.94 63.02

LINE(1st) 58.08 49.42 53.35

LINE(2nd ) 73.79 59.72 66.10

Text Classification on Long Documents (Tang et al. 2015a)

• Word co-occurrence network (w-w) , word-document network (w-d) to learn the word embedding

• Document embedding as average of word embeddings in the document

LINE(w-w) > SkipGram LINE(w-d) > LINE(w-w)

44

Accuracy

Text Classification on Short Documents (Tang et al. 2015a)

• Word co-occurrence network (w-w) , word-document network (w-d) to learn the word embedding

• Document embedding as average of word embeddings in the document

LINE(w-w) > SkipGram LINE(w-w) > LINE(w-d)

45

Accuracy

Extensions

• Other variants

• Multi-view networks

• Networks with node attributes

• Heterogeneous networks

• Task-specific network embedding

46

Extensions

• Other variants





47

Other Variants

• Leverage global structural information (Cao et al. 2015)

• Non-linear methods based on autoencoders (Wang et al. 2016)

• Directed network embedding (Ou et al. 2016)

• Signed network embedding (Wang et al. 2017)

• Shaosheng Cao, Wei Lu, and Qiongkai Xu. GraRep: Learning graph representations with global structural information. CIKM’ 2015.

• Mingdong Ou, Peng Cui, Jian Pei, Wenwu Zhu. Asymmetric transitivity preserving graph embedding. KDD, 2016.

• Daixing Wang, Peng Cui, Wenwu Zhu. Structural deep network embedding. KDD, 2016. • Suhang Wang, Jiliang Tang, Charu Aggarwal, Yi Chang, Huan Liu. Signed network embedding in social media.

SDM 2017.

48

Extensions

• Other variants





49

Multi-view Network Embedding ( Qu and Tang et al. 2017)

• Multiple types of relationships between nodes exist in real-world

networks

• E.g., following, retweeting relationships between users in Twitter

• Each type of relationship => a view of the network

• Multiple types of relationships => multi-view networks

• Infer robust node representations with multiple views

• Complementary information in different views

Figure: Networks with multiple views Meng Qu, Jian Tang, Jingbo Shang, Xiang Ren, Ming Zhang, and Jiawei Han. Learning Distributed Node

Representations for Networks with Multiple Views. To appear in CIKM 2017.

50

A Co-Regularization Approach ( Qu and Tang et al. 2017)

• Each node has a robust representation and multiple view-specific

representations

• Preserve the structure of different views through view-specific representations

• Promote the collaboration of different views to vote for robust representations

• Regularize the view-specific representations

51

A Co-Regularization Approach ( Qu and Tang et al. 2017)

• Objective

View-specific objective

Regularization objective

:robust embedding of node i

:view-specific node embedding of node i

:weights of views of node i

52

Learning the Weights of the Views via Neural Attention ( Qu and Tang et al. 2017)

• Define the attention weight of views for each node:

• According to the regularization term:

: concatenation of view-specific embeddings of node i

• Learning the weights with supervised data, e.g., node classification

: embedding of view k

53

Results of Multi-label Node Classification

54

MVE > MVE-NoAttn > LINE/node2vec

Without learning the weights of views

Single best view

Extensions

• Other variants





55

Networks with Node Attributes (Yang et al. 2015, N.Kipf et al. 2016, Liao et al. 2017)

• Networks with text information (Yang et al. 2015)

• Networks with attributes (Liao et al. 2017)

• Gender, location, text, „

• Variational graph autoencoders (N.Kipf et al. 2016)

• Encode the node with neighborhood structures and attributes

• Decode the neighborhood structures

56

• Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, Edward Y. Chang. Network representation learning with rich text information. IJCAI 2015.

• Thomas N.Kipf and Max Welling. Variational Graph Auto-encoders. NIPS Workshop 2016. • Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. Attributed Social Network Embedding. arXiv, 2017.

Extensions

• Other variants





57

Heterogeneous Network Embedding via Deep Architectures (Chang et al. 2015)

• Heterogeneous networks of images and text

• Make the embeddings of linked objects close to each other

• image-image, image-text, text-text

58

Siyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C. Aggarwal, Thomas S. Huang. Heterogeneous network embedding via Deep Architectures. KDD’15

Heterogeneous Star Network Embedding (Chen et al. 2017)

• Heterogeneous Star Networks

• Paper, keywords, authors, venues

• Aims to embed the center objects

• paper

59

Ting Chen and Yizhou Sun, "Task-Guided and Path-Augmented Heterogeneous Network Embedding for Author Identification. WSDM’17.

Extensions

• Other variants





60

Semi-supervised Text Representation (Tang et al. 2015b) • Heterogeneous text network

• Word-word, word-document, and word-label networks

• Different levels of word co-occurrences: local context-level, document-level, label-level

• Learning word embeddings through jointly training the heterogeneous networks

• Document embeddings as the average of word embeddings

Unstructured text


„

Deep learning has been attracting increasing „

A future direction of deep learning is to integrate „

The Skip-gram model is quite effective and efficient „ Information networks encode the relationships

label document

label

label

null

null

null

degree

network

edge

node word

document

classification

text

embedding


text

information

network

word …

classification

doc_1

doc_2

doc_3

doc_4 …

…

Word-document network

text

information

network

word …

classification

label_2

label_1

label_3 … …

Word-label network

Jian Tang, Meng Qu, and Qiaozhu Mei. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks. KDD’15.

61

Results on Text Classification of Long Documents

20newsgroup Wikipedia IMDB

Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1

Unsupervised LINE(G_wd) 79.73 78.40 80.14 80.13 89.14 89.14

Predictive embedding

CNN 80.15 79.43 79.25 79.32 89.00 89.00

PTE(G_wl) 82.70 81.97 79.00 79.02 85.98 85.98

PTE(G_ww+G_wl) 83.90 83.11 81.65 81.62 89.14 89.14

PTE(G_wd+G_wl) 84.39 83.64 82.29 82.27 89.76 89.76

PTE(joint) 84.20 83.39 82.51 82.49 89.80 89.80

PTE > CNN

62

Results on Text Classification of Short Documents

20newsgroup Wikipedia IMDB

Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1

Unsupervised LINE(G_ww) 74.22 70.12 71.13 71.12 73.84 73.84

Predictive embedding

CNN 76.16 73.08 72.71 72.69 75.97 75.96

PTE(G_wl) 76.45 72.74 73.44 73.42 73.92 73.91

PTE(G_ww+G_wl) 76.80 73.28 72.93 72.92 74.93 74.92

PTE(G_wd+G_wl) 77.46 74.03 73.13 73.11 75.61 75.61

PTE(joint) 77.15 73.61 73.58 73.57 75.21 75.21

PTE CNN »

63

Semi-supervised Classification with Graph Convolutional Networks (Kipf et al. 2017)

• Task: Given a graph G = (V, E), and the features of nodes

, and the labels of a subset of nodes are given.

• Learning the node representations through multi-layer graph convolutional networks

• Combining node representations (self-link) and representations of neighbors

X Î RN´D

Thomas N.Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. ICLR’17.

64

Multi-layer Graph Convolution Neural Networks

• Final objective:

65

• Starting from the node features • Define the propagation rule

Add the self-links

= Normalize the matrix

Nonlinear propagation

H (0) = X

Experimental Results (Kipf et al. 2017)

66

GCN > Label Propagation

Outline




• Extensions


• t-SNE

• LargeVis


• CNN



67

Extremely Low-dimensional Representations: 2D/3D for Visualizing Networks

„.

„. „. „.

„. Networks 2D/3D Layout

„.

High-dimensional Data

K-Nearest Neighbor Graph (KNN-G) Construction

Graph Layout

68 Heatmaps

Network Diagrams Scatter Plots

t-SNE (Maarten and Hinton, 2008, 2014 )

• State-of-the-art algorithms for high-dimensional data visualization

• Deployed in Tensorflow for visualizing the representations learned by deep neural networks.

Visualizations of MNIST Data TensorBoard Visualizations by t-SNE

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR, 2008. L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. JMLR, 2014.

69

Constructing the K-nearest Neighbor Graph

• Finding the nearest neighbors for all the data points

• Vantage-point tree

• Calculating the similarities between the data points

• Complexity: O(NlogN) w.r.t. the number of data points N

: nearest neighbors of node i

70

K-nearest Neighbor Graph Layout

• Similarity between two data points i and j in low-dimensional space is defined as:

• Objective: minimize the distance between the similarities defined in the high-dimensional spaces and low-dimensional spaces

• The complexity: O(NLogN) (Maarten, 2014).

: low-dimensional representations (coordinates) of node i

71

Limitations of t-SNE

• K-NNG construction: complexity grows O(NlogN) to the number of data points N

• Graph layout: complexity is O(NlogN) • Very sensitive parameters

72

LargeVis (Tang et al., Best Paper Nomination at WWW 2016) • Efficient approximation of K-NNG construction

• 30 times faster than t-SNE (3 million data points) • Better time-accuracy tradeoff

• Efficient probabilistic model for graph layout • O(NlogN) -> O(N) • 7 times faster than t-SNE (3 million data points) • Better visualization layouts • Stable parameters across different data sets

Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. Visualizing Large-scale and High-dimensional Data. WWW’16

73

Random Projection Trees

• Partition the whole space into different regions with multiple hyperplanes

74


75


76


77


78

K-NNG Construction

• Search nearest neighbors through traversing trees • Only data points in the leaf are considered as nearest neighbors

• Multiple trees are usually used to improve the accuracy • e.g., hundreds

79

Reduce the Number of Trees

• Construct a less accurate K-NNG with a few trees • Iteratively refine the K-NNG through “neighbor exploring”

• “A neighbor of my neighbor is also likely to be my neighbor” • Second-order neighbors are also treated as candidates of first-order neighbors

80

It Works!

• X axis: accuracy of K-NNG • Y axis: running time (minutes) • tSNE: 16 hours (95% accuracy) • LargeVis: 25 minutes

• >30 times faster than t-SNE

LargeVis

t-SNE

Random projection trees

81

Learning the Layout of KNN Graph

• Preserve the similarities of the nodes in 2D/3D space • Represent each node with a 2D/3D vector • Keep similar data close while dissimilar data far apart

• Probability of observing a binary edge between nodes (i,j):

• Likelihood of observing a weighted edge between nodes (i,j):

p(eij =wij ) = p(eij =1)wij

82

A Probabilistic Model for Graph Layout

• Objective:

• Randomly sample some negative edges • Optimized through asynchronous stochastic gradient descent • Time complexity: linear to the number of data points

γ : an unified weight assigned to negative edge

O = p(eij =wij )(i, j )ÎE

Õ (1- p(eij =wij )(i, j )ÎE

Õ )g

83

It Works Too!

• Time complexity • t-SNE: O(NlogN) • LargeVis: O(N)

• On 3 million data points • t-SNE: 45 hours • LargeVis: 5.6 hours • Seven times faster LargeVis

t-SNE

84

Visualization Quality

• Metric: classification accuracy with KNN

on 2D space

• Configuration: • LargeVis with default parameters

• t-SNE with default and optimal parameters

(tuned per data set)

• LargeVis t-SNE with optimal parameters

• LargeVis >> t-SNE with default parameters

• Parameters of LargeVis are very stable

LargeVis t-SNE (optimal parameters)

t-SNE (default parameters)

»

85


86


87

Computer Science Mathematics

88

Physics Biology

89

Computer Science vs. Mathematics

90

Computer Science vs. Physics

91

Wikipedia Articles (color: semantic cluster)

92

LiveJournal Network (color: community)

93

Computer Science Authors (color: community)

94

Summary • LargeVis: a new technique for visualizing networks

and high-dimensional data

• A better tool than t-SNE.

• >7 times faster than t-SNE on three million data points

95

Software

https://github.com/tangjianpku/LINE

https://github.com/lferry007/LargeVis

LINE: (C++)

LargeVis : (C++&Python)

https://github.com/elbamos/largeVis

(307 stars, released since 2015.3)

(307 stars, released since 2016.7)

LargeVis Tutorial: https://jlorince.github.io/viz-tutorial/

Interactive Visualization:

https://github.com/NLeSC/DiVE

R version in CRAN:

Our release:

Other tools based on our implementation:

„

96

https://github.com/tangjianpku/LINE







https://jlorince.github.io/viz-tutorial/






Outline


• Laplacian Eigenmap

• Word2Vec



• t-SNE

• LargeVis


• CNN



97

Learning Representations of Entire (Small) Networks

98

• A collection of small networks with different structures • How to represent an entire (small) network

Sentence dependency graph

Molecules

„ Ego-networks

Convolutional Neural Networks

99

PATCHY-SAN [Niepert+ '16]

Node selection (w=4 nodes)

c f g

c

d

b

a

f

e

g

h

i

d

c

d

b

f

e

c

f

e

g

d f

g i

Neighborhood assembly (at least k=4 nodes)

Neighborhood normalization (exactly k=4 nodes)

1

4

3

2

3

1

2

4

3 2

1 4

j

c

d

g i

j

1

2 3

4

100

PATCHY-SAN [Niepert+ '16]

1 2 3 4 a1 a2 „ an

Attributes

Nodes

Apply CNNs

101

Normalized neighborhood

1

4

3

2 1

2 3

4

1 2 3 4 a1 a2 „ an

„

„

Neural Message Passing

102

Structure2vec (Dai et al. 2016)

103

• ũi = σ (W1xi+W2Σ j∈N(i) ũj+W3Σ j∈N(i)xj)

• {W1, W2, W3} are parameters

• N(i) are neighbors of i

• σ is an activation function

Message passing

Structure2vec (Dai et al. 2016)

104

Message passing

• We have embedding vectors {ui}

• ũi = σ (W1xi+W2Σ j∈N(i) ũj+W3Σ j∈N(i)xj)

• Represent a graph by Σ i ũi

• Minimize the empirical square loss

• (y - θ Tσ (Σ i ũi))2

• y is the graph label

• θ is a parameter

Neural Message Passing Framework (Gilmer et al. 2017)

105

• Message passing phrase

• Readout phase

v

w1 w2

w3 w4

: message function

: vertex update function

: readout function

Gated Graph Neural Networks (Li et al. 2016)

106

• Message function:

• Vertex updating function

• Readout function

a learned matrix for each type of edge (discrete type)

Neural networks

Neural Message Passing (Gilmer et al. 2017)

107

• Message function:

• Vertex updating function

• Readout function

or Set2Set (Vinyals et al. 2015)

A neural network maps a edge vector (continuous) to a matrix

Permutation invariant

Neural Message Passing (Gilmer et al. 2017)

108

• Add virtual edges

• Add a specific type of edge between nodes without edges

• Add a master node

• Connect to every node

Table :Results on the task of predicting chemical properties (QM9 Dataset)

Summary

• Convolutional Neural Network

• Define “receptive field” on graphs

• Neural message passing

• Define message passing, node updating, and graph pooling functions

109

Outline


• Laplacian Eigenmap

• Word2Vec



• t-SNE

• LargeVis


• CNN

• Neural message passing


110

Summary

• Network representation: a new methodology for analyzing and mining networks

• State-of-the-art approaches for node representation learning


• Moving towards to task-specific node representations (e.g., PTE and GraphConv)

• Visualizing large-scale networks and high-dimensional data

• LargeVis

• Sales up to tens of millions of nodes or data points

• Learning representations of network substructures

• CNN, Neural message passing 111

Challenges & Future Work

• Scalability

• How to scale up to networks with billions of nodes

• Hierarchical representations

• How to learn hierarchical representations of networks

• Dynamic


• Multiple types of nodes, multiple types of edges

• Learning isomorphism-invariance representations of entire networks

• „

112

References ### Node Embeddings ###

[Belkin et al. 2003] Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 2003.

[Mikolov et al. 2014] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. NIPS 2014.

[Tang et al. 2015a] Jian Tang, Meng Qu, Mingzhe Wang, Jun Yan, and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. WWW’15

[Perozzi et al. 2014] Bryan Perozzi, Rami Al-Rfou, Steven Skiena. DeepWalk: Online Learning of Social Representations. KDD’14

[Grover et al. 2016] Aditya Grover and Jure Leskovec. node2vec: Scalable Feature Learning for Networks. KDD’16

[Cao et al. 2015] Shaosheng Cao, Wei Lu, and Qiongkai Xu. GraRep: learning graph representations with global structural information. CIKM’15.

[Qu et al. 2017] Meng Qu, Jian Tang, Jingbo Shang, Xiang Ren, Ming Zhang, and Jiawei Han. Learning Distributed Node Representations for Networks with Multiple Views.

[Yang et al. 2015] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, Edward Y. Chang. Network representation learning with rich text information. IJCAI 2015.

[Kipf et al. 2016] Thomas N.Kipf and Max Welling. Variational Graph Auto-encoders. NIPS Workshop 2016.

[Liao et al. 2017]Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. Attributed Social Network Embedding. arXiv, 2017.

[Tang et al. 2015b] Jian Tang, Meng Qu, and Qiaozhu Mei. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks. KDD’15.

[Kipf et al. 2017]Thomas N.Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. ICLR’17.

[Chang et al. 2017] Siyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C. Aggarwal, Thomas S. Huang. Heterogeneous network embedding via Deep Architectures. KDD’15

[Chen et al. 2017] Ting Chen and Yizhou Sun, "Task-Guided and Path-Augmented Heterogeneous Network Embedding for Author Identification. WSDM’17.

113

References [Wang et al. 2017] Daixin Wang, Peng Cui, Wenwu Zhu. Structural deep network embedding. KDD, 2016.

### Node Visualizations ###

[Maaten et al. 2008] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. JMLR, 2008.

[Maaten et al. 2014] L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. JMLR, 2014.

[Tang et al. 2016] Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. Visualizing Large-scale and High-dimensional Data. WWW’16

### Graph Embeddings ###

[Li et al. 2016] Cheng Li, Xiaoxiao Guo, and Qiaozhu Mei. 2016. DeepGraph: Graph Structure Predicts Network Growth. arXiv preprint arXiv:1610.06251 (2016).

[Niepert et al. 2016] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. In Proceedings of the 33rd annual international conference on machine learning. ACM.

[Li et al. 2017] Cheng Li, Jiaqi Ma, Xiaoxiao Guo, and Qiaozhu Mei. 2017. DeepCas: an End-to-end Predictor of Information Cascades. In Proceedings of the 26th international conference on World wide web.

[Dai et al. 2016] Dai, Hanjun, Bo Dai, and Le Song. "Discriminative embeddings of latent variable models for structured data." International Conference on Machine Learning. 2016.

[Gilmer et al. 2017] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, G. E. Dahl. Neural message passing for quantum chemistry. arXiv, 2017.

[Li et al. 2016] Y. Li, D. Tarlow, M. Brockschmidt, R. Zemel. Gated graph sequence neural networks. ICLR, 2016.

114

[email protected]

115




Optimization

• The gradient w.r.t. the embedding

• Z is the partition function:

• The complexity w.r.t. the number of data points N is O(N^2)

• Too expensive!

116

Barnes-Hut Approximation

• Rewriting the gradient as:

Attractive forces Complexity: linear to the number of edges

Repulsive forces Complexity: O(N^2)

• Constructing a quadtree of the nodes according to the current low-dimensional representations

Sum of node i and nodes in a cell:

From O(N^2) to O(NLogN)!

117

Learning Representations of Large-Scale Networksir.sdu.edu.cn/~zhuminchen/RL/tangjian2017.pdf ·...

Documents

Transcript of Learning Representations of Large-Scale Networksir.sdu.edu.cn/~zhuminchen/RL/tangjian2017.pdf ·...