A Clustering Coeficient for Weighted Complex Networks

9
1 A Clustering Coecient for Weighted Networks, with Application to Gene Expression Data Gabriela Kalna a,and Desmond J. Higham a a Department of Mathematics, University of Strathclyde, Glasgow G1 1XH, UK. E-mail:  {ra.gkal,djh }@maths.strath.ac.uk The cluster ing coecient has been used successf ully to summarise important features of unweighted, undi- rec ted net wor ks acro ss a wide range of appli cati ons in complexity science. Recently, a number of authors hav e exte nded this concept to the case of net wo rks with non-negatively weighted edges. After reviewing various alternatives, we focus on a denition due to Zhang and Horvath that can be traced back to ear- lier work of Grindrod. We give a natural and trans- parent derivation of this clustering coecient and then analyse its properties. One attraction of this version is that it deals directly with weighted edges and avoids the need to discretise, that is, to round weights up to 1 or down to 0. This has the advantages of (a) retain- ing all edge weight information, and (b) eliminati ng the requirement for an arbitrary cutolevel. Further, the extende d denition is muc h less likely to break down due to a ‘divide-by-zero’. Using our new deriva- tion and focusing on some special cases allows us to gain insights into the typical behaviour of this mea- sure . We then illustrate the idea by computin g the generalised clustering coecients, along with the cor- responding weighted degrees, for pairwise correlation gene expression data arising from microarray experi- ments. We nd that the weighted clustering and de- gree distributions reveal global topological dierences between normal and tumour networks. Keywo rds: bioinformatics, computational graph the- ory, microarray data, network topology, range depen- * Corresponding author: Gabriela Kalna, Department of Mathematics, University of Strathclyde, Glasgow G1 1XH, UK. dent random graph, small world network, Grindrod- Zhang-Horvath clustering coecient. 1. Int roduct ion Many complex data sets have natural represen- tations as net wo rks. It is acc ept ed tha t typical real- life networks are neither random graphs in the classical Erd¨os-R´enyi sens e n or regular lattices [17,28]. Hence, scientists across a wide range of dis- ciplines face the tasks of summarising, comparing, categorising and modelling these data sets in or- der to extract meaning and order. Various com- putational quanti ties have been used to char ac- terise networks; most prominently the concepts of pathlength ,  degree  and  clustering coecient  have proved extremely useful. Watts and Strogatz [28] coined the phrase  small world network  to describe the commonly occurring situation where a sparse network is highly clus- tered (like a regular lattice) yet has small path- lengths (like a random graph). Since tha t land- mark paper, many complex networks have been analysed and labelled as small worlds. Similarly, the so-called  scale–free  propert y of the degree dis- tribution [2, 17] , has become acc ept ed as a hal l- mark of many real data sets, although there is now some doubt as to its true prev alenc e [14,1 9]. Both the small world and scale–free properties have been widely studied for unweighted, or bi- nary, undirected networks. In the case of more gen- eral weighted edges it is of course possible to cre- ate a binary network by normalising, imposing a cutoand rounding to 0 and 1 [21]. However, it is our tenet that the original weights should be re- spected where possible. While the concept of de- gree extends readily from unweighted to weighted netwo rks , thi s is not tru e for the clustering co- AI Communications ISSN 0921-7126, IOS Press. All rights reserved

Transcript of A Clustering Coeficient for Weighted Complex Networks

Page 1: A Clustering Coeficient for Weighted Complex Networks

8/10/2019 A Clustering Coeficient for Weighted Complex Networks

http://slidepdf.com/reader/full/a-clustering-coeficient-for-weighted-complex-networks 1/9

1

A Clustering Coefficient for Weighted

Networks, with Application to GeneExpression Data

Gabriela Kalna a , and Desmond J. Higham a

a Department of Mathematics,University of Strathclyde,Glasgow G1 1XH, UK.E-mail: {ra.gkal,djh }@maths.strath.ac.uk

The clustering coefficient has been used successfullyto summarise important features of unweighted, undi-rected networks across a wide range of applicationsin complexity science. Recently, a number of authorshave extended this concept to the case of networkswith non-negatively weighted edges. After reviewingvarious alternatives, we focus on a denition due toZhang and Horvath that can be traced back to ear-lier work of Grindrod. We give a natural and trans-parent derivation of this clustering coefficient and thenanalyse its properties. One attraction of this version isthat it deals directly with weighted edges and avoidsthe need to discretise, that is, to round weights up to1 or down to 0. This has the advantages of (a) retain-ing all edge weight information, and (b) eliminatingthe requirement for an arbitrary cutoff level. Further,the extended denition is much less likely to breakdown due to a ‘divide-by-zero’. Using our new deriva-tion and focusing on some special cases allows us togain insights into the typical behaviour of this mea-sure. We then illustrate the idea by computing thegeneralised clustering coefficients, along with the cor-responding weighted degrees, for pairwise correlationgene expression data arising from microarray experi-ments. We nd that the weighted clustering and de-gree distributions reveal global topological differences

between normal and tumour networks.Keywords: bioinformatics, computational graph the-ory, microarray data, network topology, range depen-

*

Corresponding author: Gabriela Kalna, Department of Mathematics, University of Strathclyde, Glasgow G1 1XH,UK.

dent random graph, small world network, Grindrod-Zhang-Horvath clustering coefficient.

1. Introduction

Many complex data sets have natural represen-tations as networks. It is accepted that typicalreal-life networks are neither random graphs inthe classical Erd¨os-Renyi sense nor regular lattices[17,28]. Hence, scientists across a wide range of dis-ciplines face the tasks of summarising, comparing,categorising and modelling these data sets in or-der to extract meaning and order. Various com-putational quantities have been used to charac-terise networks; most prominently the concepts of pathlength , degree and clustering coefficient haveproved extremely useful.

Watts and Strogatz [28] coined the phrase small

world network to describe the commonly occurringsituation where a sparse network is highly clus-tered (like a regular lattice) yet has small path-lengths (like a random graph). Since that land-mark paper, many complex networks have beenanalysed and labelled as small worlds. Similarly,the so-called scale–free property of the degree dis-tribution [2,17], has become accepted as a hall-mark of many real data sets, although there is nowsome doubt as to its true prevalence [14,19].

Both the small world and scale–free propertieshave been widely studied for unweighted, or bi-nary, undirected networks. In the case of more gen-

eral weighted edges it is of course possible to cre-ate a binary network by normalising, imposing acutoff and rounding to 0 and 1 [21]. However, it isour tenet that the original weights should be re-spected where possible. While the concept of de-gree extends readily from unweighted to weightednetworks, this is not true for the clustering co-

AI CommunicationsISSN 0921-7126, IOS Press. All rights reserved

Page 2: A Clustering Coeficient for Weighted Complex Networks

8/10/2019 A Clustering Coeficient for Weighted Complex Networks

http://slidepdf.com/reader/full/a-clustering-coeficient-for-weighted-complex-networks 2/9

2 G. Kalna et al. / A Clustering Coefficient for Weighted Networks

efficient. A number of authors have recently at-tempted to generalise the clustering coefficient tothe case of weighted edges [3,16,18,29], producinga range of possible denitions.

We present here a natural and transparentderivation of a clustering coefficient for weightedgraphs. The resulting denition coincides withthose in [8,29] and hence we argue for the useof this Grindrod-Zhang-Horvath clustering coeffi-cient as a generalised measure of clustering. We be-lieve that this measure, along with the correspond-ing weighted degree distribution, gives an informa-tive high-level picture that can be used for classify-ing, comparing and modelling weighted networks, just as in the unweighted case. We give some ana-lytical insight into the usefulness of this clusteringcoefficient by studying the case of range-dependentweights, which is relevant in the biological context.Then we test the denition on an important typeof data arising in computational cell biology: geneexpression microarray data. Many recent methodsfor microarray data analysis monitor differences inthe expression of genes under various experimen-tal conditions: normal/tumour [5], multiclass can-cers [7,20], treatment/survival [23]. Pair-wise geneexpression correlation has long been used to pre-dict relationships between genes. Recently, geneco-expression networks have emerged [27,29] con-necting genes with high correlation. However, de-spite the fact that genome-wide gene expressiondata sets are available, their full potential is of-

ten not used and information from only a sub-set of genes, usually with highest variation, is ex-tracted. Hence, we view these weighted networksas ideal candidates on which to apply the new clus-tering coefficient framework. Using available mi-croarray data we construct two distinct gene co-expression networks that represent normal and tu-mour states. We examine weighted clustering coef-cients and weighted degree distributions of thesenetworks with the aim of nding tumour-relateddifferences. We emphasize that our aim is to char-acterize overall network topology rather than tocategorize individual genes or samples.

The rest of this article is organised as follows.In section 2 we start with the binary denition of clustering coefficient and list some generalisationsthat have been proposed for weighted networks. Insection 3 we give a natural derivation that leads tothe Grindrod-Zhang-Horvath denition, and showhow this can be easily computed via matrix prod-

ucts. We then use some simple examples to ex-plore the properties of this coefficient. In section4 we give some realistic computations on pairwisecorrelation networks arising from microarray data.

2. Clustering Coefficient and its Generalisations

Consider an undirected graph with normalisedweights 0 ≤ wij ≤ 1 between nodes i and j . In thebinary case wij { 0, 1} the clustering coefficient,or curvature, for node k is dened as

clust( k) := t

v(v − 1)/ 2, (1)

where v is the number of immediate neighbours of node k, and t is the number of triangles incident tonode k [21,28]. In words, clust(k) answers the ques-tion “given two nodes that are both connected to

node k, what is the likelihood that these two nodesare connected to each other?” It is straightforwardto see that the denition breaks down when v < 2,that is, node k has less than two immediate neigh-bours, and otherwise 0 ≤ clust( k) ≤ 1.

Recently, a few different extensions of the clus-tering coefficient to the general weighted case haveemerged. In [16] the weighted clustering coefficientfor node k is dened as

wclust LF (k) := i= j N (k ) wij

v(v − 1) ,

where the term i= j N (k ) wij can be seen as the

total weight of relationship in the neighbourhoodN (k) of node k.

Barrat et al.[3] introduced a measure of cluster-ing that combines topological information with theweight distribution of the network

wclust B ( k) :=

1s(v − 1)

i,j

(wki + wkj )2

aik akj a ij .

Here s = j wkj denotes the weighted degree of node k and a ij is an element of the underlying bi-nary adjacency matrix. The normalisation factor

s(v − 1) ensures that 0 ≤ wclust B (k) ≤ 1. Thisdenition of weighted clustering coefficient consid-ers only weights of edges adjacent to node k butnot the weights of edges between neighbours of thenode k.

Onnela et al.[18] took into account weights of alledges: adjacent to node k and between-neighbours.

Page 3: A Clustering Coeficient for Weighted Complex Networks

8/10/2019 A Clustering Coeficient for Weighted Complex Networks

http://slidepdf.com/reader/full/a-clustering-coeficient-for-weighted-complex-networks 3/9

G. Kalna et al. / A Clustering Coefficient for Weighted Networks 3

They considered weights 0 ≤ wij ≤ 1 and replacedthe number of triangles t in (1) with the sum of triangle intensities

wclust O (k) :=2 i,j (wik wkj wij )1/ 3

v(v − 1) .

We remark that the three clustering coefficientdenitions above suffer from the drawback thatthey require an underlying binary network; if thisis not available as a separate set of data, then pre-sumably it must be obtained by discretizing theweighted edges. Hence, as in the case where theoriginal binary denition is used for weighted net-works [21], they are dependent upon some thresh-olding parameter. Further, they break down in thecase where the number of binary neighbours, ν , isless than 2.

A denition that uses only the network weights

was proposed by Zhang and Horvath [29]wclust HZ ( k) :=

i= k j = i,j = k wki wij wjk

( i= k wki )2 − i= k w2ki

. (2)

The numerator in (2) was obtained by nding alower bound for the denominator, this ensuringthat wclust HZ is in the range [0, 1].

We also mention that rather than one clusteringcoefficient per node, Schank and Wagner [22] pre-sented a single weighted clustering coefficient forthe whole network as

wclust S := 1v w(v) v

w(v)c(v).

Here c(v) is a clustering coefficient for node v andw(v) a weight function. One possible choice of weight function is the weighted degree.

3. Weighted Clustering

3.1. Denition and Properties

Some simple algebra allows the binary clusteringcoefficient (1) to be rewritten as

clust( k) :=v− 1i=1

vj = i+1 1 × 1 × a ij

v− 1i =1

vj = i+1 1 × 1

, (3)

where a ij = 1 if the pairs of neighbours i and j of the node k are connected and a ij = 0 otherwise.

Consider now an undirected weighted networkof M nodes that is fully connected with weights0 ≤ wij = wji ≤ 1 between nodes i and j andwii = 0. Then formula (3) directly extends to thereal value case

wclust( k) :=M i =1

M j =1 wki wkj wij

M i=1

M j =1 ,j = i wki wkj

(4)

giving a natural denition for weighted networks.We also mention that the same formula was usedin [8] in the context where wij represents the prob-ability of an edge between nodes i and j in a ran-dom network model. Closer inspection shows thatthe formula (4) has a simple interpretation that isanalogous to that of the binary case:

12

M

i =1

M

j =1

wki wkj wij

is a reasonable measure of the “strength” of trian-gles that involve node k and

12

M

i =1

M

j =1 j = i

wki wkj

represents the “strength” of the pairs of neigh-bours that involve node k. Here, “strength” isbased on geometric (rather than arithmetic) aver-aging of the relevant weights. It is easy to verifythat (4) retains the property 0 ≤ wclust( k) ≤ 1.It is also trivial to show that (4) approaches the

binary value (1) if wij { 0, 1} are considered.Computationally, note that the numerator of (4)

isM

i=1

wki

M

j =1

wkj wij =M

i =1

wki (W 2)ki

= ( W 3 )kk

and the denominator isM

i=1

M

j =1

wki wkj −M

i=1

w2ki =

(eT wk )2 − || wk || 22 .

Here, (W p)ij denotes the ( i, j ) element of the pthpower of W R M × M , wk R M denotes the kthrow (or column) of W and e R M denotes thevector with all elements equal to one. Hence, aneater representation of (4) is

Page 4: A Clustering Coeficient for Weighted Complex Networks

8/10/2019 A Clustering Coeficient for Weighted Complex Networks

http://slidepdf.com/reader/full/a-clustering-coeficient-for-weighted-complex-networks 4/9

4 G. Kalna et al. / A Clustering Coefficient for Weighted Networks

wclust( k) = (W 3)kk

(eT wk )2 − || wk || 22

, (5)

which shows that the weighted clustering coeffi-cient can be computed across all nodes in O(M 3)operations. The formula (5) also makes it clear

that (4) is entirely equivalent to the Zhang-Horvath denition (2).Having derived this denition from what we be-

lieve to be a natural and informative viewpoint, wenow attempt to gain further insights by focussingon particular types of weighted network.

3.2. Limit Forms of Clustering Coefficient

We now zoom to a particular node K of a graphand explore its weighted clustering coefficient (4)in specic cases. Starting with a binary networkwij { 0, 1} we replace zero weights with a smallweight 0 < 1 (weak connections) and re-place unit weights with 1 − (strong connections).Thus, we relaxed the binary network to a weightedfully connected graph. This is an opposite trans-formation to a binarisation of a weighted network,when real valued weights below a threshold aremapped to zero and weights above the thresholdare mapped to one.

(A) In the rst case, let node K have m > 1strong and n > 1 weak connections to othernodes in the graph. Then there are (a) m(m −1)/ 2 strong-strong, (b) mn strong-weak and (c)

n(n − 1)/ 2 weak-weak neighbour pairs. Let therebe r, s and u strong edges between neighboursin cases (a), (b) and (c) respectively. It is easyto show that equation (4), for → 0, results inwclust( K ) = 2 r/m (m − 1). In words, r strong tri-angles are built over m(m − 1)/ 2 strong neighbourpairs. Thus the weighted clustering coefficient (4)approaches the binary value (1).

(B) In the second case we consider the marginalsetting v = 1: node K has a strong connection,1− , only to one node P and n weak, , connectionsto all other nodes in a complete graph. Then n outof all possible neighbour pairs involve the strong

edge between nodes K and P and n(n − 1)/ 2 pairsare formed by n weak edges adjusted to node K .Between-neighbour edges will be again strong orweak. Let there be r strong edges with one endin node P and s strong edges between “weakly”connected neighbour nodes of node K . Then from(4) we get wclust( K ) = ( r (1 − )2 + ( n − r ) 2(1 −

) + s 2(1 − ) + ( n(n − 1)/ 2 − s) 3)/ (n (1 − ) +n(n − 1) 2 / 2) . This expression results in r /n for

→ 0. In words, we can get to r out of n “weakly”connected neighbours of the node K through thestrong edge KP and strong edges connecting nodeP with these r nodes. It is clear that wclust( k) = 1only if r = n, that means there is a strong edgebetween P and all nodes weakly connected to K .Because wclust( k) = 0 if r = 0, the strong edgebetween nodes K and P is the only edge involvingnode P . That means this edge would be separatedfrom the graph in the corresponding discretisednetwork.

Case B reveals an important advantage of thegeneralised denition (4). It continues to provideuseful information in the small regime where any discretization process based on thresholding to a bi-nary network would result in v = 1 and hence an undened clustering coefficient in (1) .

3.3. Uniform Weights

Another case where the clustering coefficientsimplies arises when node K has equal weightswith all other nodes: wKj = w = constant for all j = K . In this case we have

wclust( K ) =w2 M

i=1M j =1 wij

w2 M i=1 M j =1 ,j = i 1

=M − 2i=1

M − 1j =1 wij

(M − 2)(M − 1)

and we see that wclust( K ) then reects the averageconnectivity between the other nodes in the net-work. In particular, if all between-neighbour con-nections are equal, i.e. wij = w = constant for alli = j = K , (4) results in

wclust( K ) =w M

i =1M j =1 ,j = i wki wkj

M

i=1

M

j =1 ,j = i w

kiw

kj

= w.

Generally, the weighted clustering coefficient canbe bounded in terms of the extremal network co-efficients,

mini= j = K

(wij ) ≤ wclust( K ) ≤ maxi = j = K

(wij ).

Page 5: A Clustering Coeficient for Weighted Complex Networks

8/10/2019 A Clustering Coeficient for Weighted Complex Networks

http://slidepdf.com/reader/full/a-clustering-coeficient-for-weighted-complex-networks 5/9

G. Kalna et al. / A Clustering Coefficient for Weighted Networks 5

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

λ

c e n

t r a

l c

l u s

t e r i n

g c

o e

f f i c i e

n t

M=50M=100M=200

Fig. 1. Clustering coefficient (4) of the central node in theweighted graph dened by (6).

3.4. Range Dependent Weights

The concept of a range-dependent weighted ran-dom graph , or RENGA, was introduced and an-alyzed by Grindrod [8]. These graphs were fur-ther studied by [10] and were used in [13] to pro-duce test matrices for linear algebra software. Theidea of using a distance metric to induce edgeshas proved successful in the modelling of real net-works, especially in biology [19]. We may adaptthe RENGA idea to the case of non-random range-dependent weights. Suppose that the nodes are or-dered − M , . . . , − 1, 0, 1, . . . , M and that the con-nectivity weight decays as a function of lattice dis-tance. To be specic, we let

wij = wji = λ | i− j | , (6)

for some λ [0, 1]. At one extreme, λ ≈ 0, thereare no edges after discretising to a binary net-work, and hence the traditional clustering coeffi-cient is undened. At the other extreme, λ = 1,all edges are present after discretising to a binarynetwork, and hence the traditional clustering co-efficient is 1 for each node. In Figure 1 we use net-works of size M = 50 , 100, 200 and compute the

generalised clustering coefficient (4) for the centralnode, k = 0, as λ ranges from 0 to 1. Note thatthe denition (4) makes sense for any λ > 0. Wesee that the clustering coefficient approaches thevalue zero as λ approaches zero from above; thisis perfectly reasonable behaviour. Further, as λ in-creases away from zero, the clustering coefficient

−10 0 10

0.02

0.04

0.06

0.08

0.1

nodes

c l u

s t e

r i n

g c

o e

f f i c i e

n t

λ = 0.1

−10 0 10

0.15

0.2

0.25

0.3

nodes

c l u

s t e

r i n

g c

o e

f f i c i e

n t

λ = 0.4

−10 0 10

0.3

0.35

0.4

nodes

c l u

s t e

r i n

g c

o e

f f i c i e

n t

λ = 0.7

−10 0 10

0.9926

0.9928

0.993

nodes

c l u

s t e

r i n

g c

o e

f f i c i e

n t

λ = 0.999

Fig. 2. Clustering coefficient of 21 nodes in the weightedgraph dened by (6) for different values of λ .

monotonically increases, and it matches the binaryvalue of 1 at λ = 1.

Figure 2 shows the clustering coefficient of allnodes in the case M = 10 for λ = 0 .1, 0.4, 0.7and 0 .999. For λ = 0 .1 and λ = 0 .4 the clusteringcoefficient is not monotonic in the nodal distancefrom the boundary, and there is a dramatic differ-ence between the neighbouring nodes at positionsM − 1 and M . The pictures suggest that there isno M → ∞ “continuum limit” in the sense thatthe clustering coefficients do not appear to lie ona smooth curve. Because of the special structureof the weights, we are able to investigate this issueanalytically.

Some straightforward analysis produces the for-mulas

wclust( M ) = λ k kλ2( k− 1)

(1 + λ) k kλ2( k− 1)

≈ λ

(1 + λ),

wclust ( M − 1) =

k=1 (k + 1) λ2k

1 + (1 + λ) k (k + 1) λ2k− 1 ,

wclust ( M − 2) =

k (k + 2) λ2k

1 + (1 + λ) 1 + k (k + 2) λ2k− 1 ,

Page 6: A Clustering Coeficient for Weighted Complex Networks

8/10/2019 A Clustering Coeficient for Weighted Complex Networks

http://slidepdf.com/reader/full/a-clustering-coeficient-for-weighted-complex-networks 6/9

6 G. Kalna et al. / A Clustering Coefficient for Weighted Networks

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

λ

c l u

s t e

r i n

g c

o e

f f i c i e

n t

M

M − 1

M − 2

central

Fig. 3. Clustering coefficient of the three marginal nodesand the central node 0 in the innite ( M → ∞ ) weightedgraph dened by (6).

wclust ( 0) =

3λ2 k kλ2( k− 1)

1 + k=1 λ2k− 1 4k + (4 k + 1) λ.

By summing appropriate geometric series we get,for xed 0 ≤ λ < 1, the following formulas andcorresponding ranges of values for the weightedclustering coefficients in the limit M → ∞

wclust( M ) = λ1 + λ

,

wclust( M ) [0, 1/ 2],

wclust( M − 1) = λ2(2 − λ2)

(1 + λ)(1 + λ − λ2),

wclust( M − 1) [0, 1/ 2],

wclust ( M − 2) =

λ2(3 − 2λ2)(1 + λ)(1 + 3 λ − 2λ2 − 2λ3 + λ4)

,

wclust( M − 2) [0, 1/ 2],

wclust(0) = λ2 (2 + λ2 )

(1 + λ2)(1 + λ)2 ,

wclust(0) [0, 3/ 8].

0 10 200

0.1

0.2

0.3

0.4

k

λ k

−10 0 10

0.2

0.25

0.3

0.35

nodes

c l u

s t e

r i n

g c

o e

f f i c i e

n t

−10 0 100.4

0.6

0.8

1

nodes

c l u

s t e

r i n

g c

o e

f f i c i e

n t

−10 0 10

0.7

0.8

0.9

1

nodes

c l u

s t e

r i n

g c

o e

f f i c i e

n t

t = 0.02weighted

t = 0.1binary

t = 0.001binary

Fig. 4. Clustering coefficient: thresholding to weighted andbinary networks.

The limit for the endpoint node, wclust( M ), isexceptional in that, as a function of λ, the expres-sion has a tangent of slope = 1 at the origin. Theother cases have tangents of slope = 0. We alsonote that letting λ → 1 in these expressions doesnot reproduce the clustering coefficient value thatarises from inserting λ = 1 into (6). In Figure 3 weillustrate these effects.

Binarising these networks produces very differ-ent behaviour. Figure 4 looks at the the caseM = 10 and λ = 0 .4 (cf. the upper right picture inFigure 2). The range of weights appearing in thenetwork, {λk }, are shown for reference in the up-

per left picture. The lower left and right picturesshow the clustering coefficients when weights be-low t are mapped to zero and weights above t aremapped to one, for t = 0 .1 and t = 0 .001. Theupper right picture, which more closely resemblesthe fully weighted version, corresponds to zeroingweights below t and leaving weights above t un-changed.

Overall, in analysing the interesting, nontrivial,range-dependent case (6), we see again that re-specting the real valued weights can produce a sig-nicantly different and more meaningful picture,compared with discretising.

4. Microarray Illustration

We now examine the distribution of the cluster-ing coefficient (4) in practice, along with that of the corresponding weighted degree, using pairwise

Page 7: A Clustering Coeficient for Weighted Complex Networks

8/10/2019 A Clustering Coeficient for Weighted Complex Networks

http://slidepdf.com/reader/full/a-clustering-coeficient-for-weighted-complex-networks 7/9

G. Kalna et al. / A Clustering Coefficient for Weighted Networks 7

1000 1500 20 00 25000

0.05

0.1

P ( w e

i g h t e d d e g r e e

)

0.14 0.16 0.180

0.1

0.2

P ( w

. c l u s

t . c o e

f f . )

1000 1200 1400 1600 18000

0.05

0.1

P ( w e

i g h t . d e

g r e e

)

0.26 0.28 0.3 0.320

0.05

0.1

P ( w

. c l u s

t . c o e

f f . )

600 800 1000 1200 14000

0.05

0.1

weighted degree

P ( w e

i g h t . d e g r e e

)

0 .18 0 .2 0 .22 0 .24 0 .260

0.05

0.1

0.15

weight.clust.coeff.

P ( w

. c l u s

t . c o e

f f . )

Fig. 5. Probability of weighted degree (left) and clusteringcoefficient (right). Liver cancer (top), breast cancer (mid-dle) and lymphoma (bottom): normal (circles) and tumour

(stars).

correlation networks arising from cDNA microar-ray data. Most importantly, we would like to ex-plore differences in character of weighted degreeand clustering coefficient distributions of two dif-ferent networks: normal and tumour.

cDNA microarrays are a genome-scale technol-ogy used for assessing differential gene expression[4,24]. Two different samples labeled with differentuorescent dyes hybridize on the same array. Onesample is prepared from a reference mRNA and theother from mRNA isolated from the experimentalcells. The common reference, or universal control,is collected from a pool of cell lines or a mix of all analysed samples [26,9]. The initial data arisingfrom cDNA microarray experiments are relativemRNA levels – experiment to control ratios. Thedata from different arrays require complex nor-malisation prior to comparison. Normalised valuesare usually organised in the form of a rectangu-lar M × N matrix of log-transformed ratios a ij of i = 1 , . . . , M genes in a set of j = 1 , . . . , N sam-ples.

To build weighted networks for further analy-sis of microarray data a suitable similarity mea-sure needs to be chosen which produces real valueresults and has a reasonable biological interpreta-tion. We consider the Pearson correlation

cor(i, j ) =N k=1 (a ik − µi )(a jk − µj )

σi σj,

where µi and σi are respectively the mean andthe standard deviation of gene i log-ratios, as a

20 40 60 800

0.1

0.2

P ( w e

i g h t e d d e g r e e

)

0.2 0.4 0.6 0.80

0.02

0.04

P ( w

. c l u s

t . c o e

f f . )

500 1000 15000

0.05

0.1

P ( w e

i g h t . d e

g r e e

)

0.2 0.3 0.4 0.50

0.05

0.1

P ( w

. c l u s

t . c o e

f f . )

600 800 1000 1200 14000

0.05

0.1

weighted degree

P ( w e

i g h t . d e g r e e

)

0 .18 0 .2 0 .22 0 .24 0 .260

0.05

0.1

0.15

weight.clust.coeff.

P ( w

. c l u s

t . c o e

f f . )

Fig. 6. Probability of weighted degree (left) and clusteringcoefficient (right). Lymphoma: normal (circles) and tumour(stars). Binary network from threshold = 0 .8 (top), thresh-

old P value 0.05 (middle) and weighted network (bottom).

measure of similarity between the gene expressionproles. This measure, or similar correlation mea-sures, proved useful for analysing microarray datain [12,15,21,23,29]. We dene pairwise gene simi-larity weights wij = |cor(i, j )|, for 1 ≤ i, j ≤ M ,with wij = wji [0, 1] and wii = 0. A large weightwij indicates that genes i and j are highly co-expressed (or anti-expressed). In this representa-tion the M × M matrix W denotes the symmetricweight matrix encoding the strength of connectionbetween pairs of genes.

Aware of the fact that different numbers of genesas well as samples in data sets can affect val-ues of correlation and consequently distort com-parisons of both weighted degrees and clusteringcoefficients, we looked for data consisting of thesame number of normal and tumour samples forthe same set of genes. In this experiment we usedcDNA microarray data for normal and tumour tis-sues from [6]. Data processing performed by thoseauthors included ltering of genes with more than70% missing values or less than 4 observations,UniGene mapping, and imputation of missing val-ues. The original data can be downloaded from the

Stanford Microarray Database.We selected data sets with more than ten sam-

ples in normal and tumour subsets. We present re-sults for liver cancer (12065 genes, 76 samples),breast cancer (5603 genes, 13 samples), and lym-phoma (4615 genes, 31 samples) [5,25,1]. Figure 5shows the distribution of the weighted cluster-

Page 8: A Clustering Coeficient for Weighted Complex Networks

8/10/2019 A Clustering Coeficient for Weighted Complex Networks

http://slidepdf.com/reader/full/a-clustering-coeficient-for-weighted-complex-networks 8/9

8 G. Kalna et al. / A Clustering Coefficient for Weighted Networks

ing coefficient (right), and also the distribution of the weighted degree (left) arising from these data.Circle-line and star-line represent the distributionsof normal and tumour networks respectively.

We emphasize that our aim is to study the ‘big-picture’ issue of overall network topology, as op-posed to the ‘ne-detail’ issue of clustering in-dividual genes and/or samples [11,15]. Figure 5reveals global topological differences between thetwo network types. We have performed the Kol-mogorov Smirnov test (kstest2 in MATLAB) onthe weighted degree and weighted clustering co-efficient values of normal and tumour networks.The test conrmed that normal and tumour dis-tributions are highly signicantly different ( P <0.001). In general the tumour samples give riseto smaller and more peakily distributed cluster-ing and degree. Degree ranges of normals and tu-mours start from a similar value but the degreerange of tumours is narrower. Large numbers of genes in normal samples show a high degree of connection to other genes. Differences in cluster-ing coefficient distributions are more striking. Dis-tribution ranges of normal and tumour networksonly partly overlap: most genes in normal networkshave higher clustering coefficient than any gene intumour networks. One possible biological interpre-tation of these results is that the diseased state isconsistent with a breakdown of the control mecha-nisms that regulate the co-expression of function-ally related genes. Similar results, supported byKolmogorov Smirnov test ( P < 0.001), arose whenthe matrix |A| was used instead of A. In this casecorrelation between genes is based on overall activ-ity, with over and under expression both regardedas equally valid evidence of a gene leaving its typ-ical state.

Given that the weighted clustering coefficientproduces interesting results, it is pertinent to askwhether careful thresholding to a discretised bi-nary network [21] can also reproduce these nd-ings. Clearly there is a whole parameterized fam-ily of such binary networks. In particular, highthresholding may exclude interesting features of the networks. For example, when weights abovethe threshold of 0 .8 are re-set to 1 and the remain-ing weights are re-set to zero, the clustering coeffi-cient and weighted degree distributions could notreveal the differences observed from original net-works; see Figure 6 top.

For a more systematic approach, P values maybe used to decide on signicance of correlation.

Even in this case, however, somewhat arbitrarythresholds must be imposed. For the lymphomanetworks, suppose we take the view that corre-lations ≥ 0.355 are signicant (corresponding toP ≤ 0.05) and correlations ≥ 0.456 are highly sig-nicant (corresponding to P ≤ 0.01). This wouldmean that only 18% (12%) of all possible edgesare signicant and 8% ( < 5%) are highly signi-cant in the normal (tumour) lymphoma network,so that a large amount of data is being discarded.(Of course, there are computational benets fromintroducing sparsity, but for the network sizes inthese experiments this is not a signicant issue.)In the middle of Figure 6 we plot data for theP ≤ 0.05 binary networks. Comparing this re-sult with the resalt from the original weightednetwork (bottom part of Figure 6), we see thatvery similar topology is revealed. We conclude that

the weighted clustering coefficient approach, whichdoes not require the use of arbitrary parametersand does not discard data, automatically producesresults consistent with those arrived at throughexperimenting with the parameter-dependent Pvalue version.

5. Summary

Our aim here was to argue that out of thepossible ways that have been proposed to gen-eralise the clustering coefficient to the case of aweighted network, there is one very promising can-didate; namely the Grindrod-Zhang-Horvath ver-sion [8,29]. We gave a natural derivation and illus-trated its behaviour on specic classes of network.Particular advantages of the denition are:

– It is a true generalisation, collapsing smoothlyto the binary case when edge weights tend to{0, 1} values.

– It can provide meaningful results in caseswhere any type of binary thresholding pro-duces break-down.

– It reveals natural topological properties of real

networks, and can do this without the needto specify parameters or discard potentiallyuseful data.

– For microarray data, the weighted clusteringcoefficient coefficient lead to a clear hypothe-sis about the difference between normal andcancerous states.

Page 9: A Clustering Coeficient for Weighted Complex Networks

8/10/2019 A Clustering Coeficient for Weighted Complex Networks

http://slidepdf.com/reader/full/a-clustering-coeficient-for-weighted-complex-networks 9/9

G. Kalna et al. / A Clustering Coefficient for Weighted Networks 9

Acknowledgements

Both authors are supported by EPSRC grantsGR/S62383/01 and EP/EO49370/1.

References

[1] A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma,I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet,T. Tran, X. Yu, J. I. Powell, L. Yang, G. E. Marti,T. Moore, J. H. Jr, L. Lu, D. B. Lewis, R. Tibshi-rani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D.Weisenburger, J. O. Armitage, R. Warnke, R. Levy,W. Wilson, M. R. Grever, J. C. Byrd, D. Botstein,P. O. Brown, and L. M. Staudt, Distinct types of dif-fuse large b-cell lymphoma identied by gene expres-sion proling, Nature 403 (2000), 503–511.

[2] A. Barabasi and R. Albert, Emergence of scaling inrandom networks, Science 286 (1999), 509–512.

[3] A. Barrat, M. Barthelemy, R. Pastor-Satorras, andA. Vespignani, The architecture of complex weighted

networks, PNAS 101 (2004), 3747–3752.[4] P. Brown and D. Botstein, Exploring the new world

of the genome with dna microarrays, Nature Genet. 21

(1999), 33–37.[5] X. Chen, S. Cheung, S. So, S. Fan, C. Barry, J. Higgins,

K. Lai, J. Ji, S. Dudoit, I. Ng, M. Van De Rijn, andP. Botstein, D. Brown, Gene expression patterns inhuman liver cancers, Molecular Biology of the Cell 13

(2002), 1929–1939.[6] J. Choi, U. Yu, O. Yoo, and S. Kim, Differential co-

expression analysis using microarray data and its ap-plication to human cancer, Bioinformatics 21 (2005),4348–4355.

[7] T. Golub, D. Slonim, P. Tamayo, C. Huard,M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh,J. Downing, M. Caliguri, C. Bloomeld, and E. Lan-der, Molecular classication of cancer: class discoveryand class prediction by gene expression monitoring,Science 286 (1999), 531–537.

[8] P. Grindrod, Range-dependent random graphs andtheir application to modeling large small-world pro-teome datasets, Physical Review E 66 (2002), 066702.

[9] G. Hardiman, Microarray platforms - comparisons andcontrasts, Pharmacogenomics 5 (2004), 487–502.

[10] D. J. Higham, Spectral reordering of a range-dependent weighted random graph, IMA Journal of Numerical Analysis 25 (2005), 443–457.

[11] D. J. Higham, G. Kalna, and M. Kibble, Spectral clus-tering and its use in bioinformatics, J.Computat. Appl.

Math. 204 (2007), 25–37.[12] D. J. Higham, G. Kalna, and K. Vass, Spectral anal-

ysis of two-signed microarray expression data, Mathe-matical Medicine and Biology 24 (2007), 131–148.

[13] Y. Hu and J. A. Scott, Hsl mc73: A fast multileveledler and prole reduction code, Technical ReportRAL-TR-2003-036, Rutherford Appleton Laboratory,Oxfordshire, 2003.

[14] R. Khanin and E. Wit, How scale-free are gene net-works?, J. Comput. Biol 13 (2006), 810–818.

[15] Y. Kluger, R. Basri, J. Chang, and M. Gerstein, Spec-tral biclustering of microarray data: coclustering genesand conditions, Genome Research 13 (2003), 703–716.

[16] L. Lopez-Fernandez, G. Robles, and J. Gonzalez-

Barahona, Applying social network analysis to theinformation in cvs repositories, in: Proc. of the 1st Intl. Workshop on Mining Software Repositories(MSR2004) , (2004), pp. 101–105.

[17] M. Newman, The structure and function of complexnetworks, SIAM Review 45 (2003), 167–256.

[18] J.-P. Onnela, J. Sarami, J. Kertz, and K. Kaski, In-tensity and coherence of motifs in weighted complexnetworks, Phys. Rev. E 71 (2005), 065103.

[19] N. Przulj, D. Corneil, and I. Jurisica, Modeling in-teractome: Scale-free or geometric?, Bioinformatics 20

(2004), 3508–3515.[20] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee,

C.-H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Lat-ulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda,

E. Lander, and T. Golub, Multiclass cancer diagno-sis using tumor gene expression signatures, PNAS 98

(2001), 15149–15154.[21] J. Rougemont and P. Hingamp, Dna microarray data

and contextual analysis of correlation graphs, BMC Bioinformatics 4 (2003), 4–15.

[22] T. Schank and D. Wagner, Approximating clusteringcoefficient and transitivity, J. of Graph Algorithms and Applications 9 (2005), 265–275.

[23] M. Segal, Microarray gene expression data with linkedsurvival phenotypes: Diffuse large-b-cell lymphoma re-visited, Biostatistics 7 (2006), 268–285.

[24] M. Shena, D. Shalon, R. David, and P. Brown, Quan-titative monitoring of. gene expression patterns witha complementary dna microarray, Science 270 (1995),467–470.

[25] T. Sorlie, C. Perou, R. Tibshirani, T. Aas, S. Geisler,H. Johnsen, T. Hastie, M. Eisen, M. van de Rijn,S. Jeffrey, T. Thorsen, H. Quist, J. Matese, P. Brown,D. Botstein, P. E. Lonning, and A. Borresen-Dale,Gene expression patterns of breast carcinomas dis-tinguish tumor subclasses with clinical implications,PNAS 98 (2001), 10869–10874.

[26] E. Sterrenburg, R. Turk, J. Boer, G. van Ommen, andJ. den Dunnen, A common reference for cdna microar-ray hybridizations, Nucleic Acids Res. 30 (2002), e116.

[27] J. Stuart, E. Segal, D. Koller, and S. Kim, A gene-coexpression network for global discovery of conservedgenetic modules, Science 302 (2003), 249–255.

[28] D. Watts and S. Strogatz, Collective dynamics of small-world networks, Nature 393 (1998), 440–442.

[29] B. Zhang and S. Horvath, A general framework forweighted gene co-expression network analysis, Statis-tical Applications in Genetics and Molecular Biology 4 , (2005).