Analysis of Annotated Social and Information Networks ...Information Networks: Methods and...
Transcript of Analysis of Annotated Social and Information Networks ...Information Networks: Methods and...
Miloš Savić Department of Mathematics and Informatics,
Faculty of Sciences, University of Novi Sad, Serbia
Analysis of Annotated Social and Information Networks: Methods
and Applications
Outline
● Introduction
● Fundamentals of complex network analysis
● Methods for annotated networks
● Case study 1 — analysis of enriched co-authorship networks
● Case study 2 — analysis of enriched ontology networks
● Conclusions
Network ● Network – a graph representing interactions or relations among
constituent entities of a complex system
entity
interaction
Entities Interactionsvertex edge, arc mathnode link computer sciencesite bond physicsactor tie, relation sociology
Newman’s classification of complex networks
● Technological networks ○ networks representing engineered man-made systems
● Social networks ○ Interactions and relationships among social entities
● Information networks ○ Connections between data items
● Biological networks ○ Networks representing biological systems and processes
Social networks● Social network - network-structured data describing
interactions or relations among social entities
● Social entities ○ individuals, social groups, institutions, organizations,
companies, political parties, nations
● Social links ○ opinions on other individuals (signed social networks) ○ transfers of material resources ○ links denoting collaboration, cooperation and coalition ○ links resulting from behavioral interactions ○ links imposed by formal relations within formally organized
social groups ○ links on social networking sites ○ …
Information networks● Networks depicting relations/dependencies between data items
● WWW networks ○ nodes: WWW pages ○ links: hyperlinks (directed links)
● Citation networks: references between documents ○ Scientific papers, patents, legal documents
● Linguistic networks ○ Semantic: semantic relationships (e.g., synonyms or antonyms)
between words or concepts ○ Structural: word co-occurrence networks and sentence similarity
networks ● Recommender networks
○ Bipartite graphs showing preferences of individuals towards some items
● Ontology networks (knowledge graphs) ○ relationships between ontological entities (concepts, roles,
individuals) ○ dependencies between ontology modules of a modular ontology
● Tabular datasets can be transformed to information networks ○ Nodes: data items or features themselves
○ k-nearest neighbors networks ○ A → B if B is among the top k nearest data items to A
○ eps-radius networks ○ A and B connected if distance(A, B) < Eps
○ feature correlation networks ○ Two features connected if there is a strong correlation between them
Savić et al. A Feature Selection Method Based on Feature Correlation Networks. In Proc. of MEDI’2017, pp. 248-261, 2017.
Annotated networks● Networks whose nodes are augmented with attributes
● labels/categorical attributes: the value of an attribute restricted to a set of specified categories
● attributes with numerical values ● free-text
● In this tutorial: networks whose nodes are enriched with both domain-independent metrics used in complex network analysis and domain-dependent metrics ● enriched co-authorship networks
● metrics quantifying various determinants of research performance
● enriched ontology networks ● ontology metrics used to evaluate the complexity and design
quality of ontologies
Outline
● Introduction
● Fundamentals of complex network analysis
● Methods for annotated networks
● Case study 1 — analysis of enriched co-authorship networks
● Case study 2 — analysis of enriched ontology networks
● Conclusions
Complex network analysis• Quantitative methods for studying the structure and
evolution of complex networks
• Analysis of direct and indirect connectivity of nodes, identification of connectivity trends and patterns
• Centrality metrics and algorithms — identification of the most important nodes and links in a network
• Network comprehension — identification of cohesive subgraphs (clusters/communities), analysis of connectivity between and within clusters
• Identification of evolutionary trends and principles that can explain the evolution of a network at the microscopic, mesoscopic and macroscopic level
• …
Connected components in undirected networks• Connected undirected graph — there is a path between any
two nodes • If a network is not a connected graph then it consists of
multiple connected components
• BFS/DFS • Giant connected component: a component encompassing a
vast majority of nodes
Components in directed networks• Weakly connected components
• connected components in the undirected projection of a directed network
• Strongly connected components • for every two nodes A and B
• there is a directed path from A to B, and • a directed path from B to A
Weakly connected components: {A, B, C, D, E} {F, G, H}
Strongly connected components: {B, C, D, E} {G, H}
Node degree• degree(x) = the number of links incident with x
= the number of x’s neighbors • the most basic metric to assess node importance
• e.g. in social networks: degree is a metric of social capitalhigher number of contacts → broader possibilities to spread ideas/opinions/interests and influence others
• Directed networks: in-degree and out-degree • Isolated nodes and hubs
Node Degree1 2 2 2
3 34 45 2
6 1
Core-periphery structure
● Assortative networks with localized hubs
● k-core — maximal sub-graph S containing nodes whose degree is higher than or equal to k in Svoid identifyCore(int k) { while network contains a node whose degree is < k: remove nodes whose degree is < k remaining nodes constitute k-core }
localized hubs: a k-core for a large k is a connected graph or has a giant connected component
k-core decomposition• k-cores are nested • shell-index(x) = k — x belongs to k-core, but not to (k+1)-core • Hubs with
• high shell-index: hubs connected to other hubs • low shell-index: hubs connected to low-degree nodes
Centrality metrics• Metrics to rank and identify the most important nodes/links
in the network
• Fundamental node centrality metrics originate from social network analysis • Betweenness centrality • Closeness centrality • Eigenvector centrality
• Information retrieval • centrality metrics for directed graphs inspired by
eigenvector centrality • Page rank and HITS hub and authority scores
Betweenness centrality
● A node is important if it is located on a large number of shortest paths between other nodes ○ Such node is in a position to control, maintain and influence
information flow through the network
Closeness centrality● A node is important if it is in proximity to a large number of
other nodes ● Spreading/diffusion processes: information originating at
nodes having a high closeness centrality quickly propagate through the network
Eigenvector centrality
● Recursively defined centrality: a node is important if it is directly connected to other important nodes
● EVC can be computed by successive approximations starting from a configuration in which all nodes have equal EVC
● PageRank and HITS hub/authority scores are variants of EVC for directed networks
Node similarity/distance○ Applications: community detection (hierarchical agglomerative
clustering), link prediction and identification of missing links (in the case of networks extracted from incomplete data)
● The length of the shortest path between two nodes
● Similarity based on random walks: the probability that a random walker reaches X from Y in k random walk steps
● The number of common neighbors
● The Jaccard coefficient
○ Other metrics: Adamic-Adar, Katz, personalized PageRank, cosine similarity, SimRank (recursively defined similarity)
Community structure• Community (module, node cluster)
• a subgraph that is more densely/strongly internally connected than with the rest of the network
• Automatic identification of communities — community detection algorithms
• Overlapping and non-overlapping community partitions
Network comprehension
• Santo Fortunato, 2009, “Community detection in graphs” • Agglomerative algorithms • Divisive algorithms
• Repeatedly remove links that are likely to be inter-communitarian links to form the dendrogram
• Measures indicating inter-communitarian links: edge betweenness centrality, edge clustering coefficient, edge information centrality
• Modularity-based algorithms • heuristics to maximize the modularity measure
• X — a subgraph in the network • Q(X) = the fraction of links in X - the expected fraction of links
in X under some null random network model • Dynamic algorithms
• Discovering communities by dynamical processes running on the network (e.g. label propagation)
• Method-based on statistical inference • fitting stochastic block models
Outline
● Introduction
● Fundamentals of complex network analysis
● Methods for annotated networks
● Case study 1 — analysis of enriched co-authorship networks
● Case study 2 — analysis of enriched ontology networks
● Conclusion
Analysis of annotated networks● Analysis of categorically induced subgraphs
● A - a categorical node attribute ● subgraphs induced by nodes having the same value of A
● Are categorically induced subgraphs strong clusters in the network? ● enriched co-authorship networks: do researchers from the same
department form a strongly cohesive research community?
● enriched ontology networks: do ontology modules conform to the “high cohesion” design principle (are concepts from a module strongly related)?
● Radicchi et al. notion of clusters in complex networks and graph clustering evaluation (GCE) metrics applied to categorically induced subgraphs in annotated networks
Radicchi et al. definitions of clusters● C - a subgraph of a network, x - a node in C ● kint(x) - the number of intra-subgraphs links incident with x, i.e.
links connecting x with other nodes in C ● weighted networks: the total weight of intra-subgraph links ● directed networks: the number intra-subgraph links emanating from x
● kext(x) - the number of inter-subgraph links incident with x, i.e. links connecting x with nodes that are not in C
GCE metrics● C - a subgraph of a network with N nodes ● NC - the number of nodes in C ● GCE metrics based on edge-cut
● EC (the size of the edge-cut of C) the total number/weight of (out-going) links connecting the nodes in C with nodes that are not in C
● IC — the total number/weight of intra-subgraph links in C
● Conductance(C) = EC / (EC + 2IC) [undirected networks] ● Conductance(C) = EC / (EC + IC) [directed networks] ● Expansion(C) = EC / NC ● Cut-ratio(C) = EC / N(N - NC) [only for unweighted networks]
● Lower values of conductance, expansion and cut-ratio indicate more cohesive subgraphs
● Conductance(C) < 0.5 → C is a Radicchi weak cluster
GCE metrics● C - a subgraph of a network, x - a node in C ● GCE metrics based on degree-fraction (DF)
● kext(x) - the number/weight of (out-going) inter-subgraph links incident with x
● D(x) - the (out-) degree/strength of x, D(x) = kint(x) + kext(x)
● DF(x) = kext(x) / D(x)
● Maximum-DF(C) = the maximum DF of nodes in C ● Average-DF(C) = the average DF of nodes in C ● Flake-DF(C) = the fraction of nodes in C for which DF(x) < D(x) / 2
(or, equivalently, kint(x) > kext(x))
● Lower values of Maximum-DF and Average-DF and higher values of Flake-DF indicate more cohesive subgraphs
● Flake-DF(C) = 1 → C is a Radicchi strong cluster
Analysis of annotated networks● Comparison of categorically induced subgraphs (CISs)
● A - a categorical node attribute ● CISs: subgraphs induced by nodes having the same value of A ● M - a numerical node attribute
● Do nodes from a CIS X tend to have higher values of M compared to nodes from a CISY? ● enriched co-authorship networks: do researchers from a
department X tend to be more productive/more central in the co-authorship network than researchers from a department Y?
● enriched ontology networks: are concepts from an ontology module X more important than concepts from an ontology module Y?
● Metric-based comparison test based on the MWU test and probabilities of superiority applied to two categorically induced subgraphs
Metric-based comparison test● X and Y - two independent subsets of nodes in a network ● Thr - a probability threshold indicating a strong stochastic dominance
● Metric-based-comparison-test(X, Y, Thr): ● for-each numeric attribute M:
● M(X) - the set of M values for X ● M(Y) - the set of M values for Y ● p = apply the MWU test to M(X) and M(Y) ● if the null hypothesis rejected (p < 0.05):
● compute probabilities of superiority PS(X) and PS(Y) ● x = a randomly selected value from M(X) ● y = a randomly selected value from M(Y) ● PS(X) = P(x > y), PS(Y) = P(y > x)
● if PS(X) > Thr or PS(Y) > Thr (default Thr = 0.75): ● report not only statistically significant differences
between X and Y regarding M, but a strong tendency of superiority
Metric-based comparison test● Metric-based comparison test can also be applied to two independent
sets of nodes determined by some structural criteria, e.g. ● highly and lowly coupled nodes
● highly coupled nodes: the minimal subset of nodes C such that
● core and periphery nodes when the network has a core-periphery structure ● core nodes: the minimal subset of nodes C such that
● nodes belonging to non-trivial strongly connected components and nodes not involved in cyclic dependencies in directed networks
∑x∈C
degree(x) > ∑y∉C
degree(y)
∑x∈C
shell-index(x) > ∑y∉C
shell-index(y)
Analysis of block models of annotated networks● P - a partition of the set of nodes into k node groups
● Block model corresponding to P ● nodes: node groups ● links: node groups A and B are connected if there is a node from A
connected to a node from B
● Block models of annotated network can be formed in two principal ways: ● according to a categorical node attribute
● e.g. a departmental collaboration network derived from an intra-institutional co-authorship network
● according to a partition obtained after community detection ● e.g. a network of research groups obtained after research
groups were identified by a community detection algorithm applied to the co-authorship network
Group superiority graphs of annotated networks
● P - a partition of the set of nodes into k node groups
● The block model corresponding to P shows connections among node groups
● Group superiority graphs (GSG) corresponding to P are directed graphs reflecting stochastic dominance among node groups with respect to numerical node attributes ● M - a numeric node attribute, A and B two node groups ● A → B in the GSG of M if nodes in A strongly tend to have
higher values of M than nodes in B ● GSGs: graphs derived from a block model according to
the metric-based comparison test
Mining attachment preferences in annotated networks● To which nodes new nodes connect when joining a network?
● N — an annotated network with k numeric attributes M1, M2, …, Mk ● Na and Nb — two successive evolutionary snapshots of N ● Transition from to Na to Nb
● New nodes — nodes in Nb not present in Na ● Nodes in Na can be divided into two categories
● Preferential nodes — nodes to which new nodes attached ● Non-preferential nodes — nodes that are not preferential
● Attachment preferences in the evolutionary transition from Na to Nb can be revealed by the metric-based comparison test ● e.g. {(M3, PREF), (M8, NON-PREF), (M12, PREF)}
● An algorithm for mining frequent itemsets (e.g. Apriori) applied to the set of attachment preferences of all evolutionary transitions
Outline
● Introduction
● Fundamentals of complex network analysis
● Methods for annotated networks
● Case study 1 — analysis of enriched co-authorship networks
● Case study 2 — analysis of enriched ontology networks
● Conclusions
❍❍
Research collaboration• Collaboration: key social feature of modern science • Science from a social perspective: complex self-organizing
social system • Katz: “scientific collaboration is a social process and probably
there are as many reasons for researchers to collaborate as there are reasons for people to communicate”
• Research collaboration can be studied at various levels: • intra-institutional, inter-institutional, national, international,
disciplinary, inter-disciplinary • Major research questions:
• how research collaboration is structured? • how the structure of research collaboration evolves? • how research collaboration is related to research productivity
and impact of multi-authored publications?
❍❍
Research collaboration• Research collaboration may manifest in various formal and
informal forms
• Co-authorship — the most visible and well-documented manifestation of scientific collaboration • availability of massive bibliographic databases
• Co-authorship networks — social networks encompassing researchers • Nodes — researchers • A and B are connected if A and B co-authored at least one
publication (with or without other co-authors) • Link weights — the strength of research collaboration
❍❍
Link weighting schemes• Straight scheme
• w(x, y) = the number of joint publications of x and y
• Salton’s scheme — a normalized variant of the straight scheme h(x) — the number of publications (co-)authored by x h(y) — the number of publications (co-)authored by y h(x, y) — the number of joint publications of x and y
• Newman’s scheme • More authors a paper has less weight should be added to
the total strength of research collaboration J — the set of joint publications of x and y n(k) — the number of authors of publication k
❍❍
● Case study: the FS-UNS co-authorship network ○ The network reflecting intra-institutional research collaboration at FS-
UNS (423 FS-UNS researchers from 5 departments) ○ The network extracted from bibliographic records contained in the
institutional CRIS-UNS system ● No name disambiguation problems ● Categorization of publications by the rule book prescribed by the
Serbian Ministry of Science ! Serbian research competency index metric
○ The Newman schema used to assign link weights ○ Nodes enriched with metrics quantifying different determinants of
research performance
❍❍
❍❍
Cohesiveness of research departments● All FS-UNS departments are Radicchi weak and close to Radicchi strong
clusters in the network ○ Intra-department collaborations are stronger than inter-department
collaborations for a large majority of researchers but not for all of them ● The strongest intra-department collaborations: DP and DC ● The weakest intra-department collaborations: DMI ● The most closed department: DG (the highest internal density, the lowest
conductance)
❍❍
Inter-department collaborations● Researchers involved in inter-department collaborations are
drastically more productive, collaborative and institutionally important
● The departmental collaboration networkof FS-UNS is a clique, but the strengths of inter-department collaborations are highly unbalanced(a lot of space to improve inter-departmentcollaborations)
❍❍
Metric-based comparison of departments● Kruskal-Wallis ANOVA: statistically significant differences (SSD) present
regarding SRCI and PRON, but absent regarding PROS and PROF ○ SRCI and PRON – biased measures of productivity
● SSD in both local and external collaboration
● No SSD regarding institutional importance
❍❍
Post-hoc pairwise comparison● DP and DC: superior regarding SRCI and PRON ● DMI: the lowest degree of both local and external research collaboration ● DC and DG: active stimulation of intra-institutional collaboration
❍❍
K-core decomposition• The FS-UNS co-authorship network has a strong and balanced
nested core-periphery structure • 19 cores, all of them being connected subgraphs in the network • the density of cores increases exponentially • the fraction of nodes in k-cores decreases linearly with k • Core researchers: shell-index >= 12 (32% of the total number)
❍❍
Core VS Peripheral Researchers• Core researchers are drastically more productive, collaborative and
institutionally important than peripheral researchers. • Core researchers have more significant brokerage role within their ego-
networks
❍❍
Identification or research groups
• the best performing algorithm: LV (Louvain) • the highest modularity, the lowest ratio of w(inter) and w(intra)
• agglomerative clustering techniques better than divisive
❍❍
❍❍
Research groups identified by Louvain
❍❍
Collaborations among research groups• The block model formed according to the partition of nodes obtained by
the Louvain algorithm • #nodes = 17 (research groups) • #links = 79 (collaborations between research groups), 9 strong links
❍❍
• Expected: groups that have strong collaborations tend to be more institutionally important
• The importance of a research group and the strength of inter-group research collaboration are independent of group size
❍❍
• Researchers involved in inter-group research collaborations are significantly more productive, collaborative and institutionally important
❍❍
• Comparison of research groups by analyzing group superiority graphs corresponding to productivity and collaboration metrics
• PRON and SRCI — biased measures of research productivity
❍❍
Mining attachment preferences• New FS-UNS researchers tend to attach to highly productive FS-UNS
researchers that have established a strong collaboration with their previous co-authors
Outline
● Introduction
● Fundamentals of complex network analysis
● Methods for annotated networks
● Case study 1 — analysis of enriched co-authorship networks
● Case study 2 — analysis of enriched ontology networks
● Conclusions
❍❍
• Ontology - a formal specification of shared and reusable knowledge • Description of concepts (classes) and roles (relationships) in a
knowledge domain through a set of axioms in a description logic • Backbone of the Semantic Web, specified in OWL
• Monolithic and modular ontology designs • monolithic — all captured concepts, roles, axioms and assertions gathered
together in one (large) OWL file • modular — an ontology that consists of multiple ontology modules, OWL
import feature
• Ontology networks — directed graphs showing dependencies between ontological entities • Ontology module networks (nodes: ontology modules, links: import relations
between modules) • Ontology class networks (nodes: classes, links: relations between classes) • Ontology subsumption network (nodes: classes, links: subsumption relations
between classes)
❍❍
Modular design principles• “Low coupling, high cohesion”
• an ontology module should be loosely coupled to other ontology modules → a low average node degree and the absence of hubs in ontology module networks
• concepts in an ontology module should be strongly coupled → concepts from the same module form highly cohesive subgraphs (strong clusters) in the ontology class network→ GCE metrics as metrics of ontology module cohesion • classification of modules as Radicchi strong, Radicchi weak and poorly
cohesive ontology modules • Existing ontology cohesion metrics estimate the cohesiveness of
ontology modules in isolation (dependencies to external classes are ignored)
• GCE metrics rely on external class dependencies taking into account also the principle of “low coupling”
❍❍
Modular design principles• “Avoid cyclic dependencies”
• Ontology modules belonging to a strongly connected component of the ontology module network are mutually (directly or indirectly) dependent
• Large cyclic dependencies negatively impact the following quality attributes: • understandability • reusability • maintainability
• Analysis of strongly connected components in enriched ontology modules networks in order to reveal characteristics of modules involved in cyclic dependencies • Metric-based comparison test
❍❍
Case study: enriched ontology module and class networks of SWEET (Semantic Web for Earth and Environmental Terminology)
❍ Module network ❍ metrics of internal complexity ❍ adopted software metrics ❍ centrality metrics ❍ Orme et al. coupling metrics ❍ Tartir et al. diversity metrics ❍ GCE metrics
❍ Class networks ❍ adopted software metrics ❍ centrality metrics
Connected component analysis○ SWEET exhibits a high degree of modular and conceptual
cohesion ○ The SWEET OMN is a weakly connected digraph
no isolated modules or small independent clusters of modules
○ The SWEET OCN is a weakly connected digraph, a giant connected component in the taxonomy of concept
○ All three examined networks are small-world networks
Cyclic dependencies○ A giant SCC in the SWEET ontology OMN
○ large cyclic dependencies among SWEET modules ○ small link reciprocity, but a large path reciprocity → cyclic dependencies among SWEET modules are mostly indirect
○ A large number of small-size SCCS in the SWEET OCN, large cyclic dependencies among SWEET classes are absent
○ Two classes involved in mutual subsumption relations
○ Metric-based comparison test to determine the differences between modules in the giant SCC and the rest of SWEET modules
○ SWEET has a strongly connected core encompassing the most reused and the most important SWEET modules
○ Degree distribution analysis: the SWEET ontology networks contain hubs (highly coupled nodes)
○ Metric-based comparison test: hubs tend to more voluminous and more functionally important modules than non-hub modules
GCE metrics as ontology metrics○ M - an ontology module within a modularized ontology ○ G(M) - a graph showing dependencies among classes in M
○ G(M) is a subgraph of the ontology class network
○ Basic ontology module cohesion metrics ignoring external dependencies ○ DEN - the density of G(M) ○ COMP - the number of weakly connected components in G(M)
○ Weak correlations → ontology cohesion metrics based solely on internal class dependencies are unable to identify modules whose constituent classes form strong clusters in the OCN
Cohesion of SWEET modules○ SWEET ontology modules has a satisfactory degree of
cohesion ○ 18 modules (8.87%) are Radicchi strong clusters ○ 195 modules (96.08%) are Radicchi weak clusters ○ only 8 modules are poorly cohesive (non-Radicchi-weak clusters) ○ poorly cohesive modules have a low centrality in the OMN
Outline
● Introduction
● Fundamentals of complex network analysis
● Methods for annotated networks
● Case study 1 — analysis of enriched co-authorship networks
● Case study 2 — analysis of enriched ontology networks
● Conclusions
Conclusions○ Methods to analyze annotated social and information networks
focused on categorically induced subgraphs, block models and attachment preferences
○ Case studies related to analysis annotated networks with numeric attributed being domain-dependent metrics
○ analysis of enriched co-authorship networks ○ an in-depth evaluation of research collaboration and mutual
relationships between collaboration and other determinants of research performance
○ analysis of enriched ontology networks ○ evaluation of design quality of modular ontologies with
respect to modular design principles originating from software engineering
○ More about the topics of the tutorial including presented case studies can be found in