Local and global solutions to community detection - when ...

1
Local and global solutions to community detection - when resolution matters - J.-G. Young, L. H´ ebert Dufresne, A. Allard and L.J. Dub´ e epartement de physique, de g´ enie physique, et d’optique, Universit´ e Laval, Qu´ ebec, Canada Motivations Community structure occurs at all scales, but an effective limit of resolution arises in commonly used community detection algorithms. This problem manifests itself in two distinct ways: 1. relevant structures are merged with neighboring communities. 2. small communities may be entirely overlooked. New improved detections methods must address these limits for a fuller understanding of the mesoscopic structure of real systems. Our goals are to solve these two problems by 1. combining global methods to a local measure of quality. 2. using a global solution, based on the sequential detection of com- munities. Resolution problems Resolution-limit-free methods yield partitions that do not change when applied to any sub-partitions of the network [1]. We illustrate two types of widespread problems that leads to a limit of resolution by studying two popular detection methods. Modularity based algorithms: Merging These algorithms seek to optimize the modularity quality function Q = X N i,j A ij - k i k j /m δ (γ i j ) X N i,j B ij δ (s i ,s j ) over the set of all possible nodes partitions. This objective function identifies groups of nodes that share more links than expected from the Configuration Model. N number of nodes in the network A ij adjacency matrix element m number of links in the network k i degree of node i δ (s i ,s j ) 1 if both nodes are in the same group and 0 otherwise Resolution problem: Modularity has an intrinsic scale related to the size of the network. There exists many cases where subgraphs will be further divided if fed back in the algorithm, i.e. algorithms based on modularity or similar measures merge small communities. Link clustering algorithm: Shadowing Ahn et al. link clustering algorithm (LCA) [2] aggregates links into communities based on the similarity S ( e ik ,e jk ) = |n + (i) T n + (j )| |n + (i) S n + (j )| of their respective neighborhoods, where e ij denotes the link between nodes i and j , and where n + (i), n + (j ) are their neighborhoods (central nodes i and j included). Communities are built by iteratively grouping adjacent links whose similarity exceeds a given threshold S c , chosen to maximize the density D = 2 M X c m c - (n c - 1) (n c - 1)(n c - 2) (*) Resolution problem: Links in the vicinity of dense communities exhibit vanishing similarities, and only S c 0 will allow their detection. Algorithms that rely on a global parameter and that allow overlapping communities are all vulnerable to this shadowing effect. Partitioning as a detection tool The resolution limit of quality functions that feature an implicit scale can be circumvented by forcing a partition that contains more communities than the natural optimum. We take advantage of the fact that partitioning methods require the number of modules as an input. We illustrate our method using modularity optimization. Spectral partitioning based on modularity It is possible to rewrite the modularity of a partition in q modules as a matrix equation Q Tr(S T BS ), where S is a N × (q - 1) matrix of simplex vector s T chosen from a set of q vectors. Each node is associated to a row of the matrix, and a choice of vectors corresponds to a choice of partition. An approximate optimum of this expression can be found using a eigenvector based partition algorithm [3]. Meta-algorithm I: For a large range of number groups q : 1. obtain the optimal modularity partition. 2. evaluate a local measure of quality for this partition. A peak in the local quality measure corresponds to the resolution- limit free partition. For sparse networks (most real networks qualify), the worst-case scenario has a O(q 2 max N ) running time. Selection of the optimal partition 0.00 0.25 0.50 0.75 1.00 0 5 10 15 20 25 Density Modularity =24, D0.909 =12 Q0.871 q q q A resolution limit free quality mea- sure is necessary to select the opti- mal number of modules q . We pro- pose the average link density of the partition (see eq. *), since it is local to each community and very sensitive to small changes in the par- tition. The 2 measures are pictured here for the standard ring configu- ration (24 fully connected cliques of 5 nodes connect by a minimal num- ber of links). As expected, the maximum in density leads to the true partition (q = 24, D 0.909, Q 0.867), while modularity is maximal for pair-wise grouping of the cliques (q = 12, D 0.318, Q 0.871). Benchmark results Density Modularity 0.95 1.00 0.95 1.00 0.90 0.95 1.00 0.00 0.20 0.40 0.60 µ k =25 k =20 k =15 < < < < < < Quality (avg. norm, mut. info.) We have applied our algorithm to non-overlapping Lancichi- netti-Fortunato-Radicci bench- mark networks of N = 1000 nodes with a heterogeneous community sizes distribution (power-law). We find that our algorithm outperforms the standard modularity based approaches. In other words, we identify the best partition more consistently and are able to do so over a wider range of mixing parameter μ (the ratio of external edges to the total number of edges per community). Cascading detection We have observed that the inability to detect small/sparse commu- nities in the vicinity of larger/denser ones could be circumvented by removing the troublesome structures from the networks. We propose a cascading approach to community detection that address this shadowing effect. Meta-algorithm II: Using any given community detection algorithm: 1. identify large or dense communities. 2. remove the internal links of the communities identified in step 1. 3. repeat until no new significant communities are found. The first iteration targets communities that are normally detected by the algorithm of choice. This ensures that the cascading approach retains the main features of the community structure. After removing links involved in the detected communities, a new iteration of the detection algorithm is then performed on a sparser network in which previously hidden communities are now apparent. This repeated analysis does not increase the computational cost sig- nificantly, since very few iterations are usually necessary, and since the network becomes sparser at each step. 0 10 20 30 40 50 60 70 80 90 arXiv Email Gnutella Internet PGP Power Protein Words Unassigned edges (%) Final state No cascading 1 iteration 2 iterations ... Results To investigate the efficiency and the behavior of cascading detec- tion, we have applied our ap- proach to 8 real networks us- ing the LCA. Our results show that cascading detection always improves the thoroughness of the community structure detection (the average percentage of unas- signed links drops from 41.0% to 5.3%). 10 0 10 1 10 2 10 3 2 4 6 8 10 12 14 16 Count Community size (nodes) 4 th iteration No cascading 3 rd iteration 1 st iteration 2 nd iteration Word Association We find meaningful communi- ties at multiple layers of or- ganizations. Pictured here is the distribution of the community sizes for each iteration. It suggests that communities unveiled through cas- cading detection are similar to the ones detected at the first iteration (i.e. traditional use of the detection algorithm). Bar Smoky Club Visual inspection of the detected communities confirms the quality of the hidden communities, as well as our intuition of the shadow- ing effect. For example, this triangle was detected at the third iteration of the algorithm on the Words network. This structure was overlooked during the initial detection due to the high degree of its three nodes. Conclusion We have introduced two meta-algorithms for community de- tection that address two types of resolution problems. Partitioning as a detection tool By analyzing multiple partitions of the same network using a local quality function, we find resolution limit free community structures where modularity would have merged groups. Since optimized partitioning algorithms scale nicely with the size of the system, our approach is applicable to large networks. Cascading detection We recover multiple levels of meaningful organization, discovering hidden communities along the way. Our meta-algorithm is not significantly slower than traditional ones, permitting cascading detection on large networks. Future directions Extension of the partitioning method to all Potts models based algorithm, and to line-graph partitioning. Improved cascading detection: Sequential detection (i.e. detect communities one by one) A less destructive approach to link removal (e.g. avoid the re- moval of important links that are shared by communities at lower layers of organization) Acknowledgements [1] V. A. Traag et al., “Narrow scope for resolution-limit-free community detection,” Phys. Rev. E, 84 84 84, 016114, (2011). [2] Y.-Y. Ahn et al., “Link communities reveal multiscale Complexity in Networks,” Nature, 466 466 466, 761, (2010). [3] M. A. Riolo, & M. E. J. Newman, “First-principles mul- tiway spectral partitioning of graphs,” arXiv: 1209.5969 1209.5969 1209.5969, (2012). [4] J.-G. Young et al., “Unveiling hidden communities through cascading detection on network structures,” arXiv: 1211.1364 1211.1364 1211.1364, (2012). To appear in Springer’s Lecture Notes in Com- puter Science. Internal links? Internal links are defined as links that join two nodes belonging to the same community. ?

Transcript of Local and global solutions to community detection - when ...

Page 1: Local and global solutions to community detection - when ...

Local and global solutions to community detection- when resolution matters -

J.-G. Young, L. Hebert Dufresne, A. Allard and L.J. Dube

Departement de physique, de genie physique, et d’optique, Universite Laval, Quebec, Canada

MotivationsCommunity structure occurs at all scales, but an effective limit ofresolution arises in commonly used community detection algorithms.This problem manifests itself in two distinct ways:

1. relevant structures are merged with neighboring communities.2. small communities may be entirely overlooked.

New improved detections methods must address these limits for a fullerunderstanding of the mesoscopic structure of real systems.

Our goals are to solve these two problems by

1. combining global methods to a local measure of quality.2. using a global solution, based on the sequential detection of com-

munities.

Resolution problemsResolution-limit-free methods yield partitions that do not changewhen applied to any sub-partitions of the network [1]. Weillustrate two types of widespread problems that leads to a limit ofresolution by studying two popular detection methods.

Modularity based algorithms: MergingThese algorithms seek to optimize the modularity quality function

Q =∑N

i,j

[Aij − kikj/m

]δ(γi, γj) ≡

∑N

i,jBij δ(si, sj)

over the set of all possible nodes partitions. This objective functionidentifies groups of nodes that share more links than expected from theConfiguration Model.

N ≡ number of nodes in the network

Aij ≡ adjacency matrix element

m ≡ number of links in the network

ki ≡ degree of node i

δ(si, sj) ≡ 1 if both nodes are in the same group and 0 otherwise

Resolution problem: Modularity has an intrinsic scale relatedto the size of the network. There exists many cases where subgraphswill be further divided if fed back in the algorithm, i.e. algorithms basedon modularity or similar measures merge small communities.

Link clustering algorithm: ShadowingAhn et al. link clustering algorithm (LCA) [2] aggregates links intocommunities based on the similarity

S(eik, ejk

)=|n+(i)

⋂n+(j)|

|n+(i)⋃n+(j)|

of their respective neighborhoods, where eij denotes the link betweennodes i and j, and where n+(i), n+(j) are their neighborhoods (centralnodes i and j included). Communities are built by iteratively groupingadjacent links whose similarity exceeds a given threshold Sc, chosen tomaximize the density

D =2

M

∑c

mc − (nc − 1)

(nc − 1)(nc − 2)(∗)

Resolution problem: Links in the vicinity of dense communitiesexhibit vanishing similarities, and only Sc → 0 will allow theirdetection. Algorithms that rely on a global parameter and that allowoverlapping communities are all vulnerable to this shadowing effect.

Partitioning as a detection toolThe resolution limit of quality functions that feature an implicit scalecan be circumvented by forcing a partition that contains morecommunities than the natural optimum. We take advantage of thefact that partitioning methods require the number of modules as aninput. We illustrate our method using modularity optimization.

Spectral partitioning based on modularity

It is possible to rewrite the modularity of a partition in q modulesas a matrix equation

Q ∝ Tr(STBS),

where S is a N × (q− 1) matrix of simplex vector sT chosen froma set of q vectors. Each node is associated to a row of the matrix,and a choice of vectors corresponds to a choice of partition.

An approximate optimum of this expression can be found using aeigenvector based partition algorithm [3].

Meta-algorithm I:

For a large range of number groups q:

1. obtain the optimal modularity partition.

2. evaluate a local measure of quality for this partition.

A peak in the local quality measure corresponds to the resolution-limit free partition. For sparse networks (most real networksqualify), the worst-case scenario has a O(q2

maxN) running time.

Selection of the optimal partition

0.00

0.25

0.50

0.75

1.00

0 5 10 15 20 25

DensityModularity

=24, D≈0.909

=12Q≈0.871

q

q

q

A resolution limit free quality mea-sure is necessary to select the opti-mal number of modules q. We pro-pose the average link density ofthe partition (see eq. ∗), since itis local to each community and verysensitive to small changes in the par-tition. The 2 measures are picturedhere for the standard ring configu-ration (24 fully connected cliques of5 nodes connect by a minimal num-ber of links). As expected, the maximum in density leads to thetrue partition (q = 24, D ≈ 0.909, Q ≈ 0.867), while modularity ismaximal for pair-wise grouping of the cliques (q = 12, D ≈ 0.318,Q ≈ 0.871).

Benchmark resultsDensity Modularity

0.95

1.00

0.00 0.20 0.40 0.60µ

0.95

1.00

0.00 0.20 0.40 0.60µ

0.900.951.00

0.00 0.20 0.40 0.60µ

k =25

k =20

k =15

<<

<<

<<

Qua

lity

(avg

. nor

m, m

ut. i

nfo.

)We have applied our algorithmto non-overlapping Lancichi-netti-Fortunato-Radicci bench-mark networks of N = 1000nodes with a heterogeneouscommunity sizes distribution(power-law). We find thatour algorithm outperformsthe standard modularity basedapproaches. In other words,we identify the best partitionmore consistently and areable to do so over a widerrange of mixing parameter µ (the ratio of external edges tothe total number of edges per community).

Cascading detectionWe have observed that the inability to detect small/sparse commu-nities in the vicinity of larger/denser ones could be circumvented byremoving the troublesome structures from the networks. Wepropose a cascading approach to community detection that addressthis shadowing effect.

Meta-algorithm II:Using any given community detection algorithm:

1. identify large or dense communities.

2. remove the internal links of the communities identifiedin step 1.

3. repeat until no new significant communities are found.

The first iteration targets communities that are normally detectedby the algorithm of choice. This ensures that the cascading approachretains the main features of the community structure.After removing links involved in the detected communities, a newiteration of the detection algorithm is then performed on a sparsernetwork in which previously hidden communities are nowapparent.

This repeated analysis does not increase the computational cost sig-nificantly, since very few iterations are usually necessary, and sincethe network becomes sparser at each step.

0102030405060708090

arXivEmailGnutellaInternetPGPPowerProteinW

ords

Una

ssig

ned

edge

s(%

)

Final state

No cascading1 iteration2 iterations...

ResultsTo investigate the efficiency andthe behavior of cascading detec-tion, we have applied our ap-proach to 8 real networks us-ing the LCA. Our results showthat cascading detection alwaysimproves the thoroughness of thecommunity structure detection(the average percentage of unas-signed links drops from 41.0% to5.3%).

100

101

102

103

2 4 6 8 10 12 14 16

Coun

t

Community size (nodes)

4th iteration

No cascading

3rd iteration

1st iteration2nd iteration

WordAssociation

We find meaningful communi-ties at multiple layers of or-ganizations. Pictured here is thedistribution of the community sizesfor each iteration. It suggests thatcommunities unveiled through cas-cading detection are similar to theones detected at the first iteration(i.e. traditional use of the detectionalgorithm).

Bar Smoky

Club

Visual inspection of the detectedcommunities confirms the qualityof the hidden communities, aswell as our intuition of the shadow-ing effect. For example, this trianglewas detected at the third iteration ofthe algorithm on the Words network.This structure was overlooked duringthe initial detection due to the highdegree of its three nodes.

ConclusionWe have introduced two meta-algorithms for community de-tection that address two types of resolution problems.

Partitioning as a detection tool

� By analyzing multiple partitions of the same network usinga local quality function, we find resolution limit freecommunity structures where modularity wouldhave merged groups.

� Since optimized partitioning algorithms scale nicely withthe size of the system, our approach is applicable to largenetworks.

Cascading detection

� We recover multiple levels of meaningful organization, discoveringhidden communities along the way.

� Our meta-algorithm is not significantly slower than traditional ones,permitting cascading detection on large networks.

Future directions� Extension of the partitioning method to all Potts models based

algorithm, and to line-graph partitioning.

� Improved cascading detection:

� Sequential detection (i.e. detect communities one by one)

�A less destructive approach to link removal (e.g. avoid the re-moval of important links that are shared by communities atlower layers of organization)

Acknowledgements

[1] V. A. Traag et al., “Narrow scope for resolution-limit-freecommunity detection,” Phys. Rev. E, 848484, 016114, (2011).

[2] Y.-Y. Ahn et al., “Link communities reveal multiscaleComplexity in Networks,” Nature, 466466466, 761, (2010).

[3] M. A. Riolo, & M. E. J. Newman, “First-principles mul-tiway spectral partitioning of graphs,” arXiv:1209.59691209.59691209.5969,(2012).

[4] J.-G. Young et al., “Unveiling hidden communities throughcascading detection on network structures,” arXiv:1211.13641211.13641211.1364,(2012). To appear in Springer’s Lecture Notes in Com-puter Science.

Internal links?

Internal links are

defined as links that

join two nodes

belonging to the same

community.

?