Analysis of Annotated Social and Information Networks ...Information Networks: Methods and...

Miloš Savić Department of Mathematics and Informatics,

Faculty of Sciences, University of Novi Sad, Serbia

Analysis of Annotated Social and Information Networks: Methods

and Applications

Outline

● Introduction

● Fundamentals of complex network analysis

● Methods for annotated networks

● Case study 1 — analysis of enriched co-authorship networks

● Case study 2 — analysis of enriched ontology networks

● Conclusions

Network ● Network – a graph representing interactions or relations among

constituent entities of a complex system

entity

interaction

Entities Interactionsvertex edge, arc mathnode link computer sciencesite bond physicsactor tie, relation sociology

Newman’s classification of complex networks

● Technological networks ○ networks representing engineered man-made systems

● Social networks ○ Interactions and relationships among social entities

● Information networks ○ Connections between data items

● Biological networks ○ Networks representing biological systems and processes

Social networks● Social network - network-structured data describing

interactions or relations among social entities

● Social entities ○ individuals, social groups, institutions, organizations,

companies, political parties, nations

● Social links ○ opinions on other individuals (signed social networks) ○ transfers of material resources ○ links denoting collaboration, cooperation and coalition ○ links resulting from behavioral interactions ○ links imposed by formal relations within formally organized

social groups ○ links on social networking sites ○ …

Information networks● Networks depicting relations/dependencies between data items

● WWW networks ○ nodes: WWW pages ○ links: hyperlinks (directed links)

● Citation networks: references between documents ○ Scientific papers, patents, legal documents

● Linguistic networks ○ Semantic: semantic relationships (e.g., synonyms or antonyms)

between words or concepts ○ Structural: word co-occurrence networks and sentence similarity

networks ● Recommender networks

○ Bipartite graphs showing preferences of individuals towards some items

● Ontology networks (knowledge graphs) ○ relationships between ontological entities (concepts, roles,

individuals) ○ dependencies between ontology modules of a modular ontology

● Tabular datasets can be transformed to information networks ○ Nodes: data items or features themselves

○ k-nearest neighbors networks ○ A → B if B is among the top k nearest data items to A

○ eps-radius networks ○ A and B connected if distance(A, B) < Eps

○ feature correlation networks ○ Two features connected if there is a strong correlation between them

Savić et al. A Feature Selection Method Based on Feature Correlation Networks. In Proc. of MEDI’2017, pp. 248-261, 2017.

Annotated networks● Networks whose nodes are augmented with attributes

● labels/categorical attributes: the value of an attribute restricted to a set of specified categories

● attributes with numerical values ● free-text

● In this tutorial: networks whose nodes are enriched with both domain-independent metrics used in complex network analysis and domain-dependent metrics ● enriched co-authorship networks

● metrics quantifying various determinants of research performance

● enriched ontology networks ● ontology metrics used to evaluate the complexity and design

quality of ontologies

Outline

● Introduction





● Conclusions

Complex network analysis• Quantitative methods for studying the structure and

evolution of complex networks

• Analysis of direct and indirect connectivity of nodes, identification of connectivity trends and patterns

• Centrality metrics and algorithms — identification of the most important nodes and links in a network

• Network comprehension — identification of cohesive subgraphs (clusters/communities), analysis of connectivity between and within clusters

• Identification of evolutionary trends and principles that can explain the evolution of a network at the microscopic, mesoscopic and macroscopic level

• …

Connected components in undirected networks• Connected undirected graph — there is a path between any

two nodes • If a network is not a connected graph then it consists of

multiple connected components

• BFS/DFS • Giant connected component: a component encompassing a

vast majority of nodes

Components in directed networks• Weakly connected components

• connected components in the undirected projection of a directed network

• Strongly connected components • for every two nodes A and B

• there is a directed path from A to B, and • a directed path from B to A

Weakly connected components: {A, B, C, D, E} {F, G, H}

Strongly connected components: {B, C, D, E} {G, H}

Node degree• degree(x) = the number of links incident with x

= the number of x’s neighbors • the most basic metric to assess node importance

• e.g. in social networks: degree is a metric of social capitalhigher number of contacts → broader possibilities to spread ideas/opinions/interests and influence others

• Directed networks: in-degree and out-degree • Isolated nodes and hubs

Node Degree1 2 2 2

3 34 45 2

6 1

Core-periphery structure

● Assortative networks with localized hubs

● k-core — maximal sub-graph S containing nodes whose degree is higher than or equal to k in Svoid identifyCore(int k) { while network contains a node whose degree is < k: remove nodes whose degree is < k remaining nodes constitute k-core }

localized hubs: a k-core for a large k is a connected graph or has a giant connected component

k-core decomposition• k-cores are nested • shell-index(x) = k — x belongs to k-core, but not to (k+1)-core • Hubs with

• high shell-index: hubs connected to other hubs • low shell-index: hubs connected to low-degree nodes

Centrality metrics• Metrics to rank and identify the most important nodes/links

in the network

• Fundamental node centrality metrics originate from social network analysis • Betweenness centrality • Closeness centrality • Eigenvector centrality

• Information retrieval • centrality metrics for directed graphs inspired by

eigenvector centrality • Page rank and HITS hub and authority scores

Betweenness centrality

● A node is important if it is located on a large number of shortest paths between other nodes ○ Such node is in a position to control, maintain and influence

information flow through the network

Closeness centrality● A node is important if it is in proximity to a large number of

other nodes ● Spreading/diffusion processes: information originating at

nodes having a high closeness centrality quickly propagate through the network

Eigenvector centrality

● Recursively defined centrality: a node is important if it is directly connected to other important nodes

● EVC can be computed by successive approximations starting from a configuration in which all nodes have equal EVC

● PageRank and HITS hub/authority scores are variants of EVC for directed networks

Node similarity/distance○ Applications: community detection (hierarchical agglomerative

clustering), link prediction and identification of missing links (in the case of networks extracted from incomplete data)

● The length of the shortest path between two nodes

● Similarity based on random walks: the probability that a random walker reaches X from Y in k random walk steps

● The number of common neighbors

● The Jaccard coefficient

○ Other metrics: Adamic-Adar, Katz, personalized PageRank, cosine similarity, SimRank (recursively defined similarity)

Community structure• Community (module, node cluster)

• a subgraph that is more densely/strongly internally connected than with the rest of the network

• Automatic identification of communities — community detection algorithms

• Overlapping and non-overlapping community partitions

Network comprehension

• Santo Fortunato, 2009, “Community detection in graphs” • Agglomerative algorithms • Divisive algorithms

• Repeatedly remove links that are likely to be inter-communitarian links to form the dendrogram

• Measures indicating inter-communitarian links: edge betweenness centrality, edge clustering coefficient, edge information centrality

• Modularity-based algorithms • heuristics to maximize the modularity measure

• X — a subgraph in the network • Q(X) = the fraction of links in X - the expected fraction of links

in X under some null random network model • Dynamic algorithms

• Discovering communities by dynamical processes running on the network (e.g. label propagation)

• Method-based on statistical inference • fitting stochastic block models

Outline

● Introduction





● Conclusion

Analysis of annotated networks● Analysis of categorically induced subgraphs

● A - a categorical node attribute ● subgraphs induced by nodes having the same value of A

● Are categorically induced subgraphs strong clusters in the network? ● enriched co-authorship networks: do researchers from the same

department form a strongly cohesive research community?

● enriched ontology networks: do ontology modules conform to the “high cohesion” design principle (are concepts from a module strongly related)?

● Radicchi et al. notion of clusters in complex networks and graph clustering evaluation (GCE) metrics applied to categorically induced subgraphs in annotated networks

Radicchi et al. definitions of clusters● C - a subgraph of a network, x - a node in C ● kint(x) - the number of intra-subgraphs links incident with x, i.e.

links connecting x with other nodes in C ● weighted networks: the total weight of intra-subgraph links ● directed networks: the number intra-subgraph links emanating from x

● kext(x) - the number of inter-subgraph links incident with x, i.e. links connecting x with nodes that are not in C

GCE metrics● C - a subgraph of a network with N nodes ● NC - the number of nodes in C ● GCE metrics based on edge-cut

● EC (the size of the edge-cut of C) the total number/weight of (out-going) links connecting the nodes in C with nodes that are not in C

● IC — the total number/weight of intra-subgraph links in C

● Conductance(C) = EC / (EC + 2IC) [undirected networks] ● Conductance(C) = EC / (EC + IC) [directed networks] ● Expansion(C) = EC / NC ● Cut-ratio(C) = EC / N(N - NC) [only for unweighted networks]

● Lower values of conductance, expansion and cut-ratio indicate more cohesive subgraphs

● Conductance(C) < 0.5 → C is a Radicchi weak cluster

GCE metrics● C - a subgraph of a network, x - a node in C ● GCE metrics based on degree-fraction (DF)

● kext(x) - the number/weight of (out-going) inter-subgraph links incident with x

● D(x) - the (out-) degree/strength of x, D(x) = kint(x) + kext(x)

● DF(x) = kext(x) / D(x)

● Maximum-DF(C) = the maximum DF of nodes in C ● Average-DF(C) = the average DF of nodes in C ● Flake-DF(C) = the fraction of nodes in C for which DF(x) < D(x) / 2

(or, equivalently, kint(x) > kext(x))

● Lower values of Maximum-DF and Average-DF and higher values of Flake-DF indicate more cohesive subgraphs

● Flake-DF(C) = 1 → C is a Radicchi strong cluster

Analysis of annotated networks● Comparison of categorically induced subgraphs (CISs)

● A - a categorical node attribute ● CISs: subgraphs induced by nodes having the same value of A ● M - a numerical node attribute

● Do nodes from a CIS X tend to have higher values of M compared to nodes from a CISY? ● enriched co-authorship networks: do researchers from a

department X tend to be more productive/more central in the co-authorship network than researchers from a department Y?

● enriched ontology networks: are concepts from an ontology module X more important than concepts from an ontology module Y?

● Metric-based comparison test based on the MWU test and probabilities of superiority applied to two categorically induced subgraphs

Metric-based comparison test● X and Y - two independent subsets of nodes in a network ● Thr - a probability threshold indicating a strong stochastic dominance

● Metric-based-comparison-test(X, Y, Thr): ● for-each numeric attribute M:

● M(X) - the set of M values for X ● M(Y) - the set of M values for Y ● p = apply the MWU test to M(X) and M(Y) ● if the null hypothesis rejected (p < 0.05):

● compute probabilities of superiority PS(X) and PS(Y) ● x = a randomly selected value from M(X) ● y = a randomly selected value from M(Y) ● PS(X) = P(x > y), PS(Y) = P(y > x)

● if PS(X) > Thr or PS(Y) > Thr (default Thr = 0.75): ● report not only statistically significant differences

between X and Y regarding M, but a strong tendency of superiority

Metric-based comparison test● Metric-based comparison test can also be applied to two independent

sets of nodes determined by some structural criteria, e.g. ● highly and lowly coupled nodes

● highly coupled nodes: the minimal subset of nodes C such that

● core and periphery nodes when the network has a core-periphery structure ● core nodes: the minimal subset of nodes C such that

● nodes belonging to non-trivial strongly connected components and nodes not involved in cyclic dependencies in directed networks

∑x∈C

degree(x) > ∑y∉C

degree(y)

∑x∈C

shell-index(x) > ∑y∉C

shell-index(y)

Analysis of block models of annotated networks● P - a partition of the set of nodes into k node groups

● Block model corresponding to P ● nodes: node groups ● links: node groups A and B are connected if there is a node from A

connected to a node from B

● Block models of annotated network can be formed in two principal ways: ● according to a categorical node attribute

● e.g. a departmental collaboration network derived from an intra-institutional co-authorship network

● according to a partition obtained after community detection ● e.g. a network of research groups obtained after research

groups were identified by a community detection algorithm applied to the co-authorship network

Group superiority graphs of annotated networks

● P - a partition of the set of nodes into k node groups

● The block model corresponding to P shows connections among node groups

● Group superiority graphs (GSG) corresponding to P are directed graphs reflecting stochastic dominance among node groups with respect to numerical node attributes ● M - a numeric node attribute, A and B two node groups ● A → B in the GSG of M if nodes in A strongly tend to have

higher values of M than nodes in B ● GSGs: graphs derived from a block model according to

the metric-based comparison test

Mining attachment preferences in annotated networks● To which nodes new nodes connect when joining a network?

● N — an annotated network with k numeric attributes M1, M2, …, Mk ● Na and Nb — two successive evolutionary snapshots of N ● Transition from to Na to Nb

● New nodes — nodes in Nb not present in Na ● Nodes in Na can be divided into two categories

● Preferential nodes — nodes to which new nodes attached ● Non-preferential nodes — nodes that are not preferential

● Attachment preferences in the evolutionary transition from Na to Nb can be revealed by the metric-based comparison test ● e.g. {(M3, PREF), (M8, NON-PREF), (M12, PREF)}

● An algorithm for mining frequent itemsets (e.g. Apriori) applied to the set of attachment preferences of all evolutionary transitions

Outline

● Introduction





● Conclusions

❍❍

Research collaboration• Collaboration: key social feature of modern science • Science from a social perspective: complex self-organizing

social system • Katz: “scientific collaboration is a social process and probably

there are as many reasons for researchers to collaborate as there are reasons for people to communicate”

• Research collaboration can be studied at various levels: • intra-institutional, inter-institutional, national, international,

disciplinary, inter-disciplinary • Major research questions:

• how research collaboration is structured? • how the structure of research collaboration evolves? • how research collaboration is related to research productivity

and impact of multi-authored publications?

❍❍

Research collaboration• Research collaboration may manifest in various formal and

informal forms

• Co-authorship — the most visible and well-documented manifestation of scientific collaboration • availability of massive bibliographic databases

• Co-authorship networks — social networks encompassing researchers • Nodes — researchers • A and B are connected if A and B co-authored at least one

publication (with or without other co-authors) • Link weights — the strength of research collaboration

❍❍

Link weighting schemes• Straight scheme

• w(x, y) = the number of joint publications of x and y

• Salton’s scheme — a normalized variant of the straight scheme h(x) — the number of publications (co-)authored by x h(y) — the number of publications (co-)authored by y h(x, y) — the number of joint publications of x and y

• Newman’s scheme • More authors a paper has less weight should be added to

the total strength of research collaboration J — the set of joint publications of x and y n(k) — the number of authors of publication k

❍❍

● Case study: the FS-UNS co-authorship network ○ The network reflecting intra-institutional research collaboration at FS-

UNS (423 FS-UNS researchers from 5 departments) ○ The network extracted from bibliographic records contained in the

institutional CRIS-UNS system ● No name disambiguation problems ● Categorization of publications by the rule book prescribed by the

Serbian Ministry of Science ! Serbian research competency index metric

○ The Newman schema used to assign link weights ○ Nodes enriched with metrics quantifying different determinants of

research performance

❍❍

❍❍

Cohesiveness of research departments● All FS-UNS departments are Radicchi weak and close to Radicchi strong

clusters in the network ○ Intra-department collaborations are stronger than inter-department

collaborations for a large majority of researchers but not for all of them ● The strongest intra-department collaborations: DP and DC ● The weakest intra-department collaborations: DMI ● The most closed department: DG (the highest internal density, the lowest

conductance)

❍❍

Inter-department collaborations● Researchers involved in inter-department collaborations are

drastically more productive, collaborative and institutionally important

● The departmental collaboration networkof FS-UNS is a clique, but the strengths of inter-department collaborations are highly unbalanced(a lot of space to improve inter-departmentcollaborations)

❍❍

Metric-based comparison of departments● Kruskal-Wallis ANOVA: statistically significant differences (SSD) present

regarding SRCI and PRON, but absent regarding PROS and PROF ○ SRCI and PRON – biased measures of productivity

● SSD in both local and external collaboration

● No SSD regarding institutional importance

❍❍

Post-hoc pairwise comparison● DP and DC: superior regarding SRCI and PRON ● DMI: the lowest degree of both local and external research collaboration ● DC and DG: active stimulation of intra-institutional collaboration

❍❍

K-core decomposition• The FS-UNS co-authorship network has a strong and balanced

nested core-periphery structure • 19 cores, all of them being connected subgraphs in the network • the density of cores increases exponentially • the fraction of nodes in k-cores decreases linearly with k • Core researchers: shell-index >= 12 (32% of the total number)

❍❍

Core VS Peripheral Researchers• Core researchers are drastically more productive, collaborative and

institutionally important than peripheral researchers. • Core researchers have more significant brokerage role within their ego-

networks

❍❍

Identification or research groups

• the best performing algorithm: LV (Louvain) • the highest modularity, the lowest ratio of w(inter) and w(intra)

• agglomerative clustering techniques better than divisive

❍❍

❍❍

Research groups identified by Louvain

❍❍

Collaborations among research groups• The block model formed according to the partition of nodes obtained by

the Louvain algorithm • #nodes = 17 (research groups) • #links = 79 (collaborations between research groups), 9 strong links

❍❍

• Expected: groups that have strong collaborations tend to be more institutionally important

• The importance of a research group and the strength of inter-group research collaboration are independent of group size

❍❍

• Researchers involved in inter-group research collaborations are significantly more productive, collaborative and institutionally important

❍❍

• Comparison of research groups by analyzing group superiority graphs corresponding to productivity and collaboration metrics

• PRON and SRCI — biased measures of research productivity

❍❍

Mining attachment preferences• New FS-UNS researchers tend to attach to highly productive FS-UNS

researchers that have established a strong collaboration with their previous co-authors

Outline

● Introduction





● Conclusions

❍❍

• Ontology - a formal specification of shared and reusable knowledge • Description of concepts (classes) and roles (relationships) in a

knowledge domain through a set of axioms in a description logic • Backbone of the Semantic Web, specified in OWL

• Monolithic and modular ontology designs • monolithic — all captured concepts, roles, axioms and assertions gathered

together in one (large) OWL file • modular — an ontology that consists of multiple ontology modules, OWL

import feature

• Ontology networks — directed graphs showing dependencies between ontological entities • Ontology module networks (nodes: ontology modules, links: import relations

between modules) • Ontology class networks (nodes: classes, links: relations between classes) • Ontology subsumption network (nodes: classes, links: subsumption relations

between classes)

❍❍

Modular design principles• “Low coupling, high cohesion”

• an ontology module should be loosely coupled to other ontology modules → a low average node degree and the absence of hubs in ontology module networks

• concepts in an ontology module should be strongly coupled → concepts from the same module form highly cohesive subgraphs (strong clusters) in the ontology class network→ GCE metrics as metrics of ontology module cohesion • classification of modules as Radicchi strong, Radicchi weak and poorly

cohesive ontology modules • Existing ontology cohesion metrics estimate the cohesiveness of

ontology modules in isolation (dependencies to external classes are ignored)

• GCE metrics rely on external class dependencies taking into account also the principle of “low coupling”

❍❍

Modular design principles• “Avoid cyclic dependencies”

• Ontology modules belonging to a strongly connected component of the ontology module network are mutually (directly or indirectly) dependent

• Large cyclic dependencies negatively impact the following quality attributes: • understandability • reusability • maintainability

• Analysis of strongly connected components in enriched ontology modules networks in order to reveal characteristics of modules involved in cyclic dependencies • Metric-based comparison test

❍❍

Case study: enriched ontology module and class networks of SWEET (Semantic Web for Earth and Environmental Terminology)

❍ Module network ❍ metrics of internal complexity ❍ adopted software metrics ❍ centrality metrics ❍ Orme et al. coupling metrics ❍ Tartir et al. diversity metrics ❍ GCE metrics

❍ Class networks ❍ adopted software metrics ❍ centrality metrics

Connected component analysis○ SWEET exhibits a high degree of modular and conceptual

cohesion ○ The SWEET OMN is a weakly connected digraph

no isolated modules or small independent clusters of modules

○ The SWEET OCN is a weakly connected digraph, a giant connected component in the taxonomy of concept

○ All three examined networks are small-world networks

Cyclic dependencies○ A giant SCC in the SWEET ontology OMN

○ large cyclic dependencies among SWEET modules ○ small link reciprocity, but a large path reciprocity → cyclic dependencies among SWEET modules are mostly indirect

○ A large number of small-size SCCS in the SWEET OCN, large cyclic dependencies among SWEET classes are absent

○ Two classes involved in mutual subsumption relations

○ Metric-based comparison test to determine the differences between modules in the giant SCC and the rest of SWEET modules

○ SWEET has a strongly connected core encompassing the most reused and the most important SWEET modules

○ Degree distribution analysis: the SWEET ontology networks contain hubs (highly coupled nodes)

○ Metric-based comparison test: hubs tend to more voluminous and more functionally important modules than non-hub modules

GCE metrics as ontology metrics○ M - an ontology module within a modularized ontology ○ G(M) - a graph showing dependencies among classes in M

○ G(M) is a subgraph of the ontology class network

○ Basic ontology module cohesion metrics ignoring external dependencies ○ DEN - the density of G(M) ○ COMP - the number of weakly connected components in G(M)

○ Weak correlations → ontology cohesion metrics based solely on internal class dependencies are unable to identify modules whose constituent classes form strong clusters in the OCN

Cohesion of SWEET modules○ SWEET ontology modules has a satisfactory degree of

cohesion ○ 18 modules (8.87%) are Radicchi strong clusters ○ 195 modules (96.08%) are Radicchi weak clusters ○ only 8 modules are poorly cohesive (non-Radicchi-weak clusters) ○ poorly cohesive modules have a low centrality in the OMN

Outline

● Introduction





● Conclusions

Conclusions○ Methods to analyze annotated social and information networks

focused on categorically induced subgraphs, block models and attachment preferences

○ Case studies related to analysis annotated networks with numeric attributed being domain-dependent metrics

○ analysis of enriched co-authorship networks ○ an in-depth evaluation of research collaboration and mutual

relationships between collaboration and other determinants of research performance

○ analysis of enriched ontology networks ○ evaluation of design quality of modular ontologies with

respect to modular design principles originating from software engineering

○ More about the topics of the tutorial including presented case studies can be found in

Analysis of Annotated Social and Information Networks ...Information Networks: Methods and...

Documents

Transcript of Analysis of Annotated Social and Information Networks ...Information Networks: Methods and...