Cluster stability

Cluster stability

Nees Jan van Eck and Ludo Waltman

Centre for Science and Technology Studies (CWTS), Leiden University

Workshop “Comparison of Algorithms”, Amsterdam

April 20, 2015

Problem statement

• A clustering technique can be used to obtain highly

detailed clustering results (i.e., a large number of

clusters)

• A clustering technique can be used to force each

publication to be assigned to a cluster

• However, in a highly detailed clustering, is the

assignment of publications to clusters still

meaningful?

• The assignment of a publication to a cluster may be

based on very little information (e.g., a single

citation relation)

1

Example: Waltman and Van Eck (2012)

2

Cluster stability

• To ensure that publications are assigned to clusters

in a meaningful way, we introduce the notion of

stable clusters

• Essentially, a cluster is stable if it is insensitive to

small changes in the underlying data

• Bootstrapping is used to make small changes in the

data

• There is no formal statistical framework

• To some extent, this resembles the stability

intervals in the CWTS Leiden Ranking

3

Identification of stable clusters:

Step 1

• Collect the citation network of publications

• Create a large number (e.g., 100) of bootstrap

citation networks

• In each bootstrap citation network, perform

clustering:

– Clustering technique of Waltman and Van Eck (2012)

– User-defined resolution parameter

– Smart local moving algorithm of Waltman and Van Eck (2013)

• For each pair of publications, calculate the

proportion of the bootstrap clustering results in

which the publications are in the same cluster

4

Identification of stable clusters:

Step 2

• Create a network of publications with an edge

between two publications if the publications are in

the same cluster in at least a certain proportion

(e.g., 0.9) of the bootstrap clustering results

• Identify connected components in the newly

created network

• Each connected component represents a stable

cluster

5

Non-parametric bootstrapping

• Sample with replacement from the set of all citation

relations between publications

• Make sure to obtain a sample that is of the same

size as the original set of citation relations

• Some citation relations will occur multiple times in

the sample, others won’t occur in it at all

• Based on the sampled citation relations, create a

bootstrap citation network

• Edges have integer weights in this network

6

Parametric bootstrapping

• A bootstrap citation network is a weighted variant

of the original citation network, with each edge

having an integer weight drawn from a Poisson

distribution with mean 1 (cf. Rosvall & Bergstrom,

2009)

• Total edge weight in the bootstrap citation network

will be approximately equal to the number of edges

in the original network

• For large networks, parametric and non-parametric

bootstrapping coincide

• We use parametric bootstrapping

7

Data

• Library & Information Sciences (LIS):

– Time period: 1996-2013

– Publications: 31,534

– Citation links: 131,266

• Astrophysics (Berlin dataset):

– Time period: 2003-2010

– Publications: 101,828

– Citation links: 924,171

8

Cluster stability LIS

9

Stable clusters LIS (resolution 2)

10

Stable clusters LIS (resolution 2)

11

Cluster stability Berlin

12

Cluster stability

13

LIS Berlin

Conclusions

• What is a good clustering of publication?

– High accuracy: Publications in the same cluster are topically

related

– High level of detail: It is possible to have a large number of

clusters

– Comprehensiveness: The clustering includes all publications

– Uniformity in cluster size: Clusters are of roughly the same size

• It seems impossible to obtain a clustering that has

all properties listed above

• At least one property needs to be given up

14

Conclusions

• Why cannot we have an accurate and detailed

clustering that includes all publications?

– Consider the field of scientometrics

– We would expect an accurate and detailed clustering to have

clusters dealing with topics such as indicators, science mapping,

collaboration, patents, etc.

– However, many publications in scientometrics (e.g., case studies)

do not neatly belong to one of these topics and therefore cannot

be accurately assigned to a cluster

• If we want to have an accurate and detailed

clustering, we need to be satisfied with a clustering

that doesn’t comprehensively cover all publications

• The clustering covers only publications related to

the main topics in the fields15

Conclusions

• Analysis of cluster stability offers an approach to

distinguish between meaningful and non-

meaningful assignments of publications to clusters

• Clustering based on direct citations is

computationally attractive but ignores relevant

information (e.g., bibliographic coupling)

• A post processing procedure can be developed to

try to assign ‘isolated publications’ to stable

clusters based on additional information

• Cluster stability is a general idea that can be

applied also to other clustering approaches

16

References

Rosvall, M., & Bergstrom, C.T. (2009). Mapping change

in large networks. PLoS ONE, 5(1), e8694.

http://dx.doi.org/10.1371/journal.pone.0008694

Waltman, L., & Van Eck, N.J. (2012). A new methodology

for constructing a publication-level classification

system of science. JASIST, 63(12), 2378-2392.

http://dx.doi.org/10.1002/asi.22748

Waltman, L., & Van Eck, N.J. (2013). A smart local moving

algorithm for large-scale modularity-based community

detection. European Physical Journal B, 86(11), 471.

http://dx.doi.org/10.1140/epjb/e2013-40829-0

17

http://dx.doi.org/10.1371/journal.pone.0008694

http://dx.doi.org/10.1002/asi.22748

http://dx.doi.org/10.1140/epjb/e2013-40829-0

Cluster stability

Documents

Transcript of Cluster stability