Heterogeneous Information Network Clustering Methods

63
Heterogeneous Information Network Clustering Methods Data Mining Lab,Big Data Research Center Reportor:Wenbao Li Wenbao Li

Transcript of Heterogeneous Information Network Clustering Methods

Page 1: Heterogeneous Information Network Clustering Methods

Heterogeneous Information Network Clustering Methods

Data Mining Lab,Big Data Research Center

Reportor:Wenbao Li

Wenbao Li

Page 2: Heterogeneous Information Network Clustering Methods

Background

Graph(network) clustering has attracted increasing research interest.

For homogeneous network:Spectral clustering,symmetric Non-negative Matrix Factorization,Markov clustering,Ncut,Mcut,...

However,heterogeneous information network clustering are concentrated until recently.

Wenbao Li

Page 3: Heterogeneous Information Network Clustering Methods

Background Heterogeneous Information network: Is an information network composed of multiple types of objects. Consists of some partial attributes within types of objects and links between different types of objects. Examples: DBLP(author,paper,conference,term) Social Network(people,groups,books,blogs,posts,etc) Movies(movie,actor,director,) Newsgroup(news,writer,group)

Wenbao Li

Page 4: Heterogeneous Information Network Clustering Methods

Part One

Wenbao Li

Page 5: Heterogeneous Information Network Clustering Methods

Related Work RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

ENetClus(Manish Gupta,Charu C. Aggarwal,Jiawei Han,Yizhou Sun,ASONAM'11)

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

PathSelClus(Yizhou Sun,Brandon Norick,Jiawei Han,Xifeng Yan,Philip S. Yu,KDD'12)

CGC(Wei Cheng,Xiang Zhang,Zhishan Guo,Yubao Wu,KDD'13)

SI-Cluster(Yang Zhou,Ling Liu,KDD'13)

ComClus(Ran Wang,Chuan Shi,Philip S. Yu,Bin Wu,PAKDD'13)

Wenbao Li

Page 6: Heterogeneous Information Network Clustering Methods

RankClus

Wenbao Li

Page 7: Heterogeneous Information Network Clustering Methods

RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

Idea:Iteratively clustering and ranking which map the target type into a new K-dimensional feature space according to which the clustering is performing.

Advantage: improve the performance of clustering and ranking simultaneously. avoiding to calculate the pairwise similarity of target objects.

Wenbao Li

Page 8: Heterogeneous Information Network Clustering Methods

RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

Some Definitions

Bi-type Information Network

Wenbao Li

Page 9: Heterogeneous Information Network Clustering Methods

RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

Some Definitions

Ranking Function

Wenbao Li

Page 10: Heterogeneous Information Network Clustering Methods

RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

Some Definitions

Conditional Rank and Within-Cluster rank

Wenbao Li

Page 11: Heterogeneous Information Network Clustering Methods

RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

Some Definitions

Target type:the type we are going to cluster. Attribute type:the other types.

Assumptions: 0XXW

Wenbao Li

Page 12: Heterogeneous Information Network Clustering Methods

RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

Flow:① Give an initial partition of target object X② Compute the conditional ranking ③ Estimate the parameter④ Form a new feature space⑤ Calculate the center of each cluster according to the new feature space(mean).⑥ According the new feature space ,assign each target object into the nearest cluster

,, || kk XYXX rr

KkmikiKm ,...,2,1;,...,2,1, KkmikiKm ,...,2,1;,...,2,1,

Wenbao Li

Page 13: Heterogeneous Information Network Clustering Methods

RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

Ranking Score——Ranking function Simple Rank

Authority Rank Give ranking scores according some authority rules.

Wenbao Li

Page 14: Heterogeneous Information Network Clustering Methods

RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

Wenbao Li

Page 15: Heterogeneous Information Network Clustering Methods

RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

Estimate the assignment parameterSet

EM to estimate KkmikiKm ,...,2,1;,...,2,1,

Wenbao Li

Page 16: Heterogeneous Information Network Clustering Methods

RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

Cluster Centers and Distance MeasureK-dimensional vector:Center of cluster k:

Distance measure:

Wenbao Li

Page 17: Heterogeneous Information Network Clustering Methods

RankClus(Yizhou Sun, Jiawei Han , Peixiang Zhao, Zhijun Yin , Hong Cheng, Tianyi

Wu,EDBT'09)

Extensions to arbitrary multi typed information network

One-type:set Y = X. Bi-type with :Add type Z,Z = X. map to 2K-dimensional feature space. Multi-typed: N types. map to NK-dimensional feature space.

0XXW

Wenbao Li

Page 18: Heterogeneous Information Network Clustering Methods

NetClus

Wenbao Li

Page 19: Heterogeneous Information Network Clustering Methods

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

NetClustering(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09) Idea:Find a new K-dimensional feature space by ranking which are determined by a probability generative model. Advantage: suit for multi types objects. cluster attribute type object.

Wenbao Li

Page 20: Heterogeneous Information Network Clustering Methods

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

Definitions

Information Network

Wenbao Li

Page 21: Heterogeneous Information Network Clustering Methods

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

Definitions

Star Network Schema

Wenbao Li

Page 22: Heterogeneous Information Network Clustering Methods

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

Definitions

Net-Cluster

Wenbao Li

Page 23: Heterogeneous Information Network Clustering Methods

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

Flow:① Give an initial partition of G,which is K clusters.And induce net-clusters from the partition.② Build ranking-based probabilities generative model for each net-cluster,i.e.③ Calculate the posterior probabilities for each target object and then adjust their cluster assignment according to the new measure defined by the posterior probabilities to each cluster④ Repeat Step 2 and 3 until the cluster does not change significantly,i.e.⑤ Calculate the posterior probabilities for each attribute object

in each net-cluster

KkkC 10

KktkCxP 1)|(

))|(( xCp tk

Kktk

Kk

tk

Kkk CCC 1

111

*

))|(( * xCp k

Wenbao Li

Page 24: Heterogeneous Information Network Clustering Methods

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

Induce net-clusters①Initial:random②Other:according to the definition of net-clusters.

Wenbao Li

Page 25: Heterogeneous Information Network Clustering Methods

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

Probabilistic Generative Model for target objects①Given an attribute object x and its type Tx ,the probability to visit x in G is

②Assumption:③Generate a paper di in the network:

Wenbao Li

Page 26: Heterogeneous Information Network Clustering Methods

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

Posterior Probability for target Objects and Attribute Objects①Generative probability of a target object:

②Smoothing handling:

③Posterior probability:EM

Wenbao Li

Page 27: Heterogeneous Information Network Clustering Methods

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

Posterior probability for attribute objects

Wenbao Li

Page 28: Heterogeneous Information Network Clustering Methods

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

E.:Ranking distribution for Attribute Objects①Simple Ranking

②Authority RankingI. II. As the following PROPERTY2

Wenbao Li

Page 29: Heterogeneous Information Network Clustering Methods

NetClus(Yizhou Sun,Yintao Yu,Jiawei Han,KDD'09)

III. According to the DBLP rules

Wenbao Li

Page 30: Heterogeneous Information Network Clustering Methods

GenClus

Wenbao Li

Page 31: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Idea:cluster with incomplete attributes across objects and consider different types of links which may have variable importance.

Advantage: Based strength-aware

of different links Probabilistic clustering

model

Wenbao Li

Page 32: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Some Definitions

Heterogeneous IN:G = (V,E,W)Mapping function from object to object:

A is object type set.Mapping function from link to link type:

R is link type set.Relation from type A to type B:Attributes:

AV :

RE:

ABBA 1 RR

Wenbao Li

Page 33: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Some Definitions

Example1:DBLP

Wenbao Li

Page 34: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Some Definitions

Example2:Weather sensor network

Wenbao Li

Page 35: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Some Definitions

Formation

Wenbao Li

Page 36: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Clustering Model:① Two properties:① attribute generated with high probability② links beteen objects which have similar clustering probability.② likelihood function of attribute:

two tasks

Wenbao Li

Page 37: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Task One_Modeling Attribute Generation

assume the attribute values have two type:text,numerical① Text attribute with categorical distribution

② Numerical attribute with Gaussian distribution

Wenbao Li

Page 38: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Task One_Modeling Attribute Generation Multiple Attributes

assume the independence among these attribute,then

Wenbao Li

Page 39: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Task Two_Modeling Structural Consistency The more similar the two objects are in terms of cluster membership,the more likely they are connected by a link.

① Consistency function

② Probability of

partition function(配分函数)

Wenbao Li

Page 40: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Unified Model(overall goal)

Wenbao Li

Page 41: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Algorithm Flow:① Initial:Set initial strength of different types of links with equally importance.② Clustering optimization step:Fix the link type weights to the best value ,determined in the last iteration.Then optimize the objective function with regard to and ,that is

EM

Wenbao Li

Page 42: Heterogeneous Information Network Clustering Methods

GenClus(Yizhou Sun,Charu C.Aggarwal,Jiawei Han,VLDB'12)

Algorithm Flow:③ Link type strength learning step:Fix the clustering configuration parameters

corresponding to the values determined in the last step, and use it to determine the best value

of ,which is consistent with current clustering results.

③ Iteratively repeat step 2 and 3 until convergence is achieved.

Newton-Raphson

Wenbao Li

Page 43: Heterogeneous Information Network Clustering Methods

PathSelClus

Wenbao Li

Page 44: Heterogeneous Information Network Clustering Methods

PathSelClus(Yizhou Sun,Brandon Norick,Jiawei Han,Xifeng Yan,Philip S. Yu,KDD'12)

Idea:integrating meta-path selection and user-guided clustering to improve both the performance of clustering and learn the weights of different meta-paths.

Wenbao Li

Page 45: Heterogeneous Information Network Clustering Methods

PathSelClus(Yizhou Sun,Brandon Norick,Jiawei Han,Xifeng Yan,Philip S. Yu,KDD'12)

Example

Wenbao Li

Page 46: Heterogeneous Information Network Clustering Methods

PathSelClus(Yizhou Sun,Brandon Norick,Jiawei Han,Xifeng Yan,Philip S. Yu,KDD'12)

Meta-path selection:M-PS problem is then to determine which meta-paths or their weighted combination to use for a specific clustering task.

User-Guided Clustering:UGU is clustering under the condition of limited object seeds in each cluster given by users.

Wenbao Li

Page 47: Heterogeneous Information Network Clustering Methods

PathSelClus(Yizhou Sun,Brandon Norick,Jiawei Han,Xifeng Yan,Philip S. Yu,KDD'12)

Input: The target type for clustering,type T. The number of cluster K. The object seeds for each cluster, A set of M meta-paths starting from type T, Output: The weight of each meta-path The clustering results

Wenbao Li

Page 48: Heterogeneous Information Network Clustering Methods

PathSelClus(Yizhou Sun,Brandon Norick,Jiawei Han,Xifeng Yan,Philip S. Yu,KDD'12)

Modeling the Relation Generation

Modeling the Users Guidance Dirichlet Distribution

Uniform Distribution

Wenbao Li

Page 49: Heterogeneous Information Network Clustering Methods

PathSelClus(Yizhou Sun,Brandon Norick,Jiawei Han,Xifeng Yan,Philip S. Yu,KDD'12)

Modeling the weights for meta-path selection

by evaluating the consistency between its relation matrix and the user-guided clustering result .

mW m

Wenbao Li

Page 50: Heterogeneous Information Network Clustering Methods

PathSelClus(Yizhou Sun,Brandon Norick,Jiawei Han,Xifeng Yan,Philip S. Yu,KDD'12)

Unified Model

EM optimization

Wenbao Li

Page 51: Heterogeneous Information Network Clustering Methods

CGC

Wenbao Li

Page 52: Heterogeneous Information Network Clustering Methods

CGC(Wei Cheng,Xiang Zhang,Zhishan Guo,Yubao Wu,KDD'13)

Co-Regularized Multi-Domain Graph Clustering

Idea:Based on NMF,deal with cross-domain with many to many weighted relations. Use loss function regularization

Wenbao Li

Page 53: Heterogeneous Information Network Clustering Methods

CGC(Wei Cheng,Xiang Zhang,Zhishan Guo,Yubao Wu,KDD'13)

Co-Regularized Multi-Domain Clustering

Residual of sum of squares loss function

B).

kkkkA

hHHALDomainleS

d

ajjF

T

...)..2

maxarg)(min:ing.1

21

)(2)()()()( ,

Wenbao Li

Page 54: Heterogeneous Information Network Clustering Methods

CGC(Wei Cheng,Xiang Zhang,Zhishan Guo,Yubao Wu,KDD'13)

Co-Regularized Multi-Domain Clustering Joint Matrix optimization

Wenbao Li

Page 55: Heterogeneous Information Network Clustering Methods

SI-Cluster

Wenbao Li

Page 56: Heterogeneous Information Network Clustering Methods

SI-Cluster(Yang Zhou,Ling Liu,KDD'13)

Idea:define new vertex similarity metric in terms of self-influence similarity and co-influence similarity,and then according the similarity calculated from the social graph and associated activity graph,combine into total social influence and do clustering.

Wenbao Li

Page 57: Heterogeneous Information Network Clustering Methods

ComClus

Wenbao Li

Page 58: Heterogeneous Information Network Clustering Methods

ComClus(Ran Wang,Chuan Shi,Philip S. Yu,Bin Wu,PAKDD'13)

deals with the hybrid network with heterogeneous and homogeneous network simultaneously.

applies star schema with self loop to organize the hybrid network and uses a probability model to represent the generative probability of objects.

Wenbao Li

Page 59: Heterogeneous Information Network Clustering Methods

Part Two

Wenbao Li

Page 60: Heterogeneous Information Network Clustering Methods

Our Ideas and Involved Difficulties

From the point of multi-view(multi-kernel) clustering

do clustering for all types of objects with constraints which are determined by the relations between different types of objects.

But how to model the clustering of single type object and the constraints?

),,.()()()(),,( 321 VPACONSVfPfAfVPAfobj

Wenbao Li

Page 61: Heterogeneous Information Network Clustering Methods

Our Ideas and Involved Difficulties

From point of regularization(such as CGC) Do clustering for a target type(such as A,author)

How to model the clustering of single type object and the regularization of its related type.

),(),()( 21 VALPALAfObj

Wenbao Li

Page 62: Heterogeneous Information Network Clustering Methods

Our Ideas and Involved Difficulties

From the point of meta-path(PathSelClus) A-P-A A-V-A A-T-A

Wenbao Li

Page 63: Heterogeneous Information Network Clustering Methods

Over

Wenbao Li