Cluster stability

18
Cluster stability Nees Jan van Eck and Ludo Waltman Centre for Science and Technology Studies (CWTS), Leiden University Workshop “Comparison of Algorithms”, Amsterdam April 20, 2015

Transcript of Cluster stability

Page 1: Cluster stability

Cluster stability

Nees Jan van Eck and Ludo Waltman

Centre for Science and Technology Studies (CWTS), Leiden University

Workshop “Comparison of Algorithms”, Amsterdam

April 20, 2015

Page 2: Cluster stability

Problem statement

• A clustering technique can be used to obtain highly

detailed clustering results (i.e., a large number of

clusters)

• A clustering technique can be used to force each

publication to be assigned to a cluster

• However, in a highly detailed clustering, is the

assignment of publications to clusters still

meaningful?

• The assignment of a publication to a cluster may be

based on very little information (e.g., a single

citation relation)

1

Page 3: Cluster stability

Example: Waltman and Van Eck (2012)

2

Page 4: Cluster stability

Cluster stability

• To ensure that publications are assigned to clusters

in a meaningful way, we introduce the notion of

stable clusters

• Essentially, a cluster is stable if it is insensitive to

small changes in the underlying data

• Bootstrapping is used to make small changes in the

data

• There is no formal statistical framework

• To some extent, this resembles the stability

intervals in the CWTS Leiden Ranking

3

Page 5: Cluster stability

Identification of stable clusters:

Step 1

• Collect the citation network of publications

• Create a large number (e.g., 100) of bootstrap

citation networks

• In each bootstrap citation network, perform

clustering:

– Clustering technique of Waltman and Van Eck (2012)

– User-defined resolution parameter

– Smart local moving algorithm of Waltman and Van Eck (2013)

• For each pair of publications, calculate the

proportion of the bootstrap clustering results in

which the publications are in the same cluster

4

Page 6: Cluster stability

Identification of stable clusters:

Step 2

• Create a network of publications with an edge

between two publications if the publications are in

the same cluster in at least a certain proportion

(e.g., 0.9) of the bootstrap clustering results

• Identify connected components in the newly

created network

• Each connected component represents a stable

cluster

5

Page 7: Cluster stability

Non-parametric bootstrapping

• Sample with replacement from the set of all citation

relations between publications

• Make sure to obtain a sample that is of the same

size as the original set of citation relations

• Some citation relations will occur multiple times in

the sample, others won’t occur in it at all

• Based on the sampled citation relations, create a

bootstrap citation network

• Edges have integer weights in this network

6

Page 8: Cluster stability

Parametric bootstrapping

• A bootstrap citation network is a weighted variant

of the original citation network, with each edge

having an integer weight drawn from a Poisson

distribution with mean 1 (cf. Rosvall & Bergstrom,

2009)

• Total edge weight in the bootstrap citation network

will be approximately equal to the number of edges

in the original network

• For large networks, parametric and non-parametric

bootstrapping coincide

• We use parametric bootstrapping

7

Page 9: Cluster stability

Data

• Library & Information Sciences (LIS):

– Time period: 1996-2013

– Publications: 31,534

– Citation links: 131,266

• Astrophysics (Berlin dataset):

– Time period: 2003-2010

– Publications: 101,828

– Citation links: 924,171

8

Page 10: Cluster stability

Cluster stability LIS

9

Page 11: Cluster stability

Stable clusters LIS (resolution 2)

10

Page 12: Cluster stability

Stable clusters LIS (resolution 2)

11

Page 13: Cluster stability

Cluster stability Berlin

12

Page 14: Cluster stability

Cluster stability

13

LIS Berlin

Page 15: Cluster stability

Conclusions

• What is a good clustering of publication?

– High accuracy: Publications in the same cluster are topically

related

– High level of detail: It is possible to have a large number of

clusters

– Comprehensiveness: The clustering includes all publications

– Uniformity in cluster size: Clusters are of roughly the same size

• It seems impossible to obtain a clustering that has

all properties listed above

• At least one property needs to be given up

14

Page 16: Cluster stability

Conclusions

• Why cannot we have an accurate and detailed

clustering that includes all publications?

– Consider the field of scientometrics

– We would expect an accurate and detailed clustering to have

clusters dealing with topics such as indicators, science mapping,

collaboration, patents, etc.

– However, many publications in scientometrics (e.g., case studies)

do not neatly belong to one of these topics and therefore cannot

be accurately assigned to a cluster

• If we want to have an accurate and detailed

clustering, we need to be satisfied with a clustering

that doesn’t comprehensively cover all publications

• The clustering covers only publications related to

the main topics in the fields15

Page 17: Cluster stability

Conclusions

• Analysis of cluster stability offers an approach to

distinguish between meaningful and non-

meaningful assignments of publications to clusters

• Clustering based on direct citations is

computationally attractive but ignores relevant

information (e.g., bibliographic coupling)

• A post processing procedure can be developed to

try to assign ‘isolated publications’ to stable

clusters based on additional information

• Cluster stability is a general idea that can be

applied also to other clustering approaches

16

Page 18: Cluster stability

References

Rosvall, M., & Bergstrom, C.T. (2009). Mapping change

in large networks. PLoS ONE, 5(1), e8694.

http://dx.doi.org/10.1371/journal.pone.0008694

Waltman, L., & Van Eck, N.J. (2012). A new methodology

for constructing a publication-level classification

system of science. JASIST, 63(12), 2378-2392.

http://dx.doi.org/10.1002/asi.22748

Waltman, L., & Van Eck, N.J. (2013). A smart local moving

algorithm for large-scale modularity-based community

detection. European Physical Journal B, 86(11), 471.

http://dx.doi.org/10.1140/epjb/e2013-40829-0

17