Cluster stability
-
Upload
nees-jan-van-eck -
Category
Documents
-
view
177 -
download
3
Transcript of Cluster stability
Cluster stability
Nees Jan van Eck and Ludo Waltman
Centre for Science and Technology Studies (CWTS), Leiden University
Workshop “Comparison of Algorithms”, Amsterdam
April 20, 2015
Problem statement
• A clustering technique can be used to obtain highly
detailed clustering results (i.e., a large number of
clusters)
• A clustering technique can be used to force each
publication to be assigned to a cluster
• However, in a highly detailed clustering, is the
assignment of publications to clusters still
meaningful?
• The assignment of a publication to a cluster may be
based on very little information (e.g., a single
citation relation)
1
Example: Waltman and Van Eck (2012)
2
Cluster stability
• To ensure that publications are assigned to clusters
in a meaningful way, we introduce the notion of
stable clusters
• Essentially, a cluster is stable if it is insensitive to
small changes in the underlying data
• Bootstrapping is used to make small changes in the
data
• There is no formal statistical framework
• To some extent, this resembles the stability
intervals in the CWTS Leiden Ranking
3
Identification of stable clusters:
Step 1
• Collect the citation network of publications
• Create a large number (e.g., 100) of bootstrap
citation networks
• In each bootstrap citation network, perform
clustering:
– Clustering technique of Waltman and Van Eck (2012)
– User-defined resolution parameter
– Smart local moving algorithm of Waltman and Van Eck (2013)
• For each pair of publications, calculate the
proportion of the bootstrap clustering results in
which the publications are in the same cluster
4
Identification of stable clusters:
Step 2
• Create a network of publications with an edge
between two publications if the publications are in
the same cluster in at least a certain proportion
(e.g., 0.9) of the bootstrap clustering results
• Identify connected components in the newly
created network
• Each connected component represents a stable
cluster
5
Non-parametric bootstrapping
• Sample with replacement from the set of all citation
relations between publications
• Make sure to obtain a sample that is of the same
size as the original set of citation relations
• Some citation relations will occur multiple times in
the sample, others won’t occur in it at all
• Based on the sampled citation relations, create a
bootstrap citation network
• Edges have integer weights in this network
6
Parametric bootstrapping
• A bootstrap citation network is a weighted variant
of the original citation network, with each edge
having an integer weight drawn from a Poisson
distribution with mean 1 (cf. Rosvall & Bergstrom,
2009)
• Total edge weight in the bootstrap citation network
will be approximately equal to the number of edges
in the original network
• For large networks, parametric and non-parametric
bootstrapping coincide
• We use parametric bootstrapping
7
Data
• Library & Information Sciences (LIS):
– Time period: 1996-2013
– Publications: 31,534
– Citation links: 131,266
• Astrophysics (Berlin dataset):
– Time period: 2003-2010
– Publications: 101,828
– Citation links: 924,171
8
Cluster stability LIS
9
Stable clusters LIS (resolution 2)
10
Stable clusters LIS (resolution 2)
11
Cluster stability Berlin
12
Cluster stability
13
LIS Berlin
Conclusions
• What is a good clustering of publication?
– High accuracy: Publications in the same cluster are topically
related
– High level of detail: It is possible to have a large number of
clusters
– Comprehensiveness: The clustering includes all publications
– Uniformity in cluster size: Clusters are of roughly the same size
• It seems impossible to obtain a clustering that has
all properties listed above
• At least one property needs to be given up
14
Conclusions
• Why cannot we have an accurate and detailed
clustering that includes all publications?
– Consider the field of scientometrics
– We would expect an accurate and detailed clustering to have
clusters dealing with topics such as indicators, science mapping,
collaboration, patents, etc.
– However, many publications in scientometrics (e.g., case studies)
do not neatly belong to one of these topics and therefore cannot
be accurately assigned to a cluster
• If we want to have an accurate and detailed
clustering, we need to be satisfied with a clustering
that doesn’t comprehensively cover all publications
• The clustering covers only publications related to
the main topics in the fields15
Conclusions
• Analysis of cluster stability offers an approach to
distinguish between meaningful and non-
meaningful assignments of publications to clusters
• Clustering based on direct citations is
computationally attractive but ignores relevant
information (e.g., bibliographic coupling)
• A post processing procedure can be developed to
try to assign ‘isolated publications’ to stable
clusters based on additional information
• Cluster stability is a general idea that can be
applied also to other clustering approaches
16
References
Rosvall, M., & Bergstrom, C.T. (2009). Mapping change
in large networks. PLoS ONE, 5(1), e8694.
http://dx.doi.org/10.1371/journal.pone.0008694
Waltman, L., & Van Eck, N.J. (2012). A new methodology
for constructing a publication-level classification
system of science. JASIST, 63(12), 2378-2392.
http://dx.doi.org/10.1002/asi.22748
Waltman, L., & Van Eck, N.J. (2013). A smart local moving
algorithm for large-scale modularity-based community
detection. European Physical Journal B, 86(11), 471.
http://dx.doi.org/10.1140/epjb/e2013-40829-0
17