arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open...

12
The NOESIS Network-Oriented Exploration, Simulation, and Induction System ıctor Mart´ ınez, * Fernando Berzal, and Juan-Carlos Cubero Department of Computer Science and Artificial Intelligence, University of Granada, Spain Network data mining has become an important area of study due to the large number of problems it can be applied to. This paper presents NOESIS, an open source framework for network data mining that provides a large collection of network analysis techniques, including the analysis of network structural properties, community detection methods, link scoring, and link prediction, as well as network visualization algorithms. It also features a complete stand–alone graphical user interface that facilitates the use of all these techniques. The NOESIS framework has been designed using solid object–oriented design principles and structured parallel programming. As a lightweight library with minimal external dependencies and a permissive software license, NOESIS can be incorporated into other software projects. Released under a BSD license, it is available from http://noesis.ikor.org. Contents I. Introduction 1 II. The design of the NOESIS framework 2 A. System architecture 2 B. Core classes 2 C. Supported data formats 2 D. Graphical user interface 3 III. Network analysis tools 3 A. Network models 3 B. Structural properties of networks 4 C. Network visualization techniques 5 IV. Network data mining techniques 6 A. Community detection 6 B. Link scoring and prediction 8 V. Conclusion 11 Acknowledgments 11 References 11 * Electronic address: [email protected] Electronic address: [email protected] Electronic address: [email protected] I. INTRODUCTION Data mining techniques are intended to extract infor- mation from large volumes of data (Tan et al., 2006). Data mining includes tasks such as classification, regres- sion, clustering, or anomaly detection, among others. Traditional data mining techniques are typically applied to tabulated data. Novel techniques have also been de- vised for semi-structured or structured data, since ex- ploiting the relationships among instances from a dataset leads to new research and development opportunities (Getoor and Diehl, 2005). For example, network data mining has been used to predict previously unknown protein interactions in protein-protein interaction networks (Mart´ ınez et al., 2014). It has also been used to study and predict future author collaborations and tendencies in co-authorship networks (Pavlov and Ichise, 2007). Different network mining techniques are used by popular internet search engines to rank the most relevant websites (Page et al., 1999). These are only some examples of the large number of applications of network data mining. There are many software tools that facilitate the anal- ysis of networked data. Some tools provide closed so- lutions for end users who need to work with their own network data sets, whereas other tools cater to software developers as software libraries that collect network anal- ysis algorithms. Most tools allow the analysis of net- work topology and the computation of different struc- tural properties of networks having thousands or even millions of nodes. Some of them also include implemen- tations of specific network data mining techniques, such as community detection algorithms or predictive mod- els, including link prediction (L¨ u and Zhou, 2011) and epidemic models (Keeling and Eames, 2005). For instance, Pajek (Batagelj and Mrvar, 1998), NodeXL (Smith et al., 2009), Gephi (Bastian et al., 2009), and UCINET (Borgatti et al., 2002) are widely used for social network analysis (SNA). Graphviz (Ell- son et al., 2002) and Cytoscape (Shannon et al., 2003) are two well–known alternatives for network visualiza- tion. Finally, igraph (Csardi and Nepusz, 2006) and NetworkX (Schult and Swart, 2008) are two popular soft- ware libraries of network algorithms. A more comprehen- sive and up–to–date list of available software tools can be found at Wikipedia: https://en.wikipedia.org/ wiki/Social_network_analysis_software. NOESIS, whose name stands for Network–Oriented Exploration, Simulation, and Induction System, is a soft- ware framework for network analysis and mining. It tries to combine the best features of closed social network anal- ysis tools and extensible algorithm libraries, while provid- arXiv:1611.04810v2 [cs.SI] 23 Jun 2017

Transcript of arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open...

Page 1: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

The NOESIS Network-Oriented Exploration, Simulation, and InductionSystem

Vıctor Martınez,∗ Fernando Berzal,† and Juan-Carlos Cubero‡

Department of Computer Science and Artificial Intelligence, University of Granada, Spain

Network data mining has become an important area of study due to the large number of problemsit can be applied to. This paper presents NOESIS, an open source framework for network datamining that provides a large collection of network analysis techniques, including the analysis ofnetwork structural properties, community detection methods, link scoring, and link prediction,as well as network visualization algorithms. It also features a complete stand–alone graphicaluser interface that facilitates the use of all these techniques. The NOESIS framework has beendesigned using solid object–oriented design principles and structured parallel programming. As alightweight library with minimal external dependencies and a permissive software license, NOESIScan be incorporated into other software projects. Released under a BSD license, it is availablefrom http://noesis.ikor.org.

Contents

I. Introduction 1

II. The design of the NOESIS framework 2A. System architecture 2B. Core classes 2C. Supported data formats 2D. Graphical user interface 3

III. Network analysis tools 3A. Network models 3

B. Structural properties of networks 4C. Network visualization techniques 5

IV. Network data mining techniques 6A. Community detection 6B. Link scoring and prediction 8

V. Conclusion 11

Acknowledgments 11

References 11∗Electronic address: [email protected]†Electronic address: [email protected]

‡Electronic address: [email protected]

I. INTRODUCTION

Data mining techniques are intended to extract infor-mation from large volumes of data (Tan et al., 2006).Data mining includes tasks such as classification, regres-sion, clustering, or anomaly detection, among others.Traditional data mining techniques are typically appliedto tabulated data. Novel techniques have also been de-vised for semi-structured or structured data, since ex-ploiting the relationships among instances from a datasetleads to new research and development opportunities(Getoor and Diehl, 2005).

For example, network data mining has been usedto predict previously unknown protein interactions inprotein-protein interaction networks (Martınez et al.,2014). It has also been used to study and predict futureauthor collaborations and tendencies in co-authorshipnetworks (Pavlov and Ichise, 2007). Different networkmining techniques are used by popular internet searchengines to rank the most relevant websites (Page et al.,1999). These are only some examples of the large numberof applications of network data mining.

There are many software tools that facilitate the anal-ysis of networked data. Some tools provide closed so-lutions for end users who need to work with their ownnetwork data sets, whereas other tools cater to software

developers as software libraries that collect network anal-ysis algorithms. Most tools allow the analysis of net-work topology and the computation of different struc-tural properties of networks having thousands or evenmillions of nodes. Some of them also include implemen-tations of specific network data mining techniques, suchas community detection algorithms or predictive mod-els, including link prediction (Lu and Zhou, 2011) andepidemic models (Keeling and Eames, 2005).

For instance, Pajek (Batagelj and Mrvar, 1998),NodeXL (Smith et al., 2009), Gephi (Bastian et al.,2009), and UCINET (Borgatti et al., 2002) are widelyused for social network analysis (SNA). Graphviz (Ell-son et al., 2002) and Cytoscape (Shannon et al., 2003)are two well–known alternatives for network visualiza-tion. Finally, igraph (Csardi and Nepusz, 2006) andNetworkX (Schult and Swart, 2008) are two popular soft-ware libraries of network algorithms. A more comprehen-sive and up–to–date list of available software tools canbe found at Wikipedia: https://en.wikipedia.org/wiki/Social_network_analysis_software.

NOESIS, whose name stands for Network–OrientedExploration, Simulation, and Induction System, is a soft-ware framework for network analysis and mining. It triesto combine the best features of closed social network anal-ysis tools and extensible algorithm libraries, while provid-

arX

iv:1

611.

0481

0v2

[cs

.SI]

23

Jun

2017

Page 2: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

2 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

3rdDpartyapplications

GraphicalDUserDInterface

DDHAL:DHardwareDAbstractionDLayer

NOESISDAPI

Datasources

Applicationgenerator

DAL:DataDAccessDLayer

NOESISReflectiveDKernel

FIG. 1 The NOESIS framework architecture and its core sub-systems.

ing support for parallel execution, something that mostlisted tools lack. NOESIS is completely written in Javaand its source code has been released under a permissiveBSD open source license.

Our paper is structured as follows. In Section 2, theNOESIS architectural design principles are briefly de-scribed. Section 3 covers the network analysis techniquesincluded in NOESIS. Network data mining techniques aresurveyed in Section 4. Finally, Section 5 describes theNOESIS project current status and roadmap.

II. THE DESIGN OF THE NOESIS FRAMEWORK

NOESIS has been designed to be an easily–extensibleframework whose architecture provides the basis for theimplementation of network data mining techniques. Inorder to achieve this, NOESIS is designed around ab-stract interfaces and a set of core classes that provideessential functions, which allows the implementation ofdifferent features in independent components with strongcohesion and loose coupling. NOESIS components aredesigned to be maintainable and reusable.

A. System architecture

The NOESIS framework architecture and its core sub-systems are displayed in Figure 1. These subsystems aredescribed below.

The lowest-level component is the hardware abstrac-tion layer (HAL), which provides support for the execu-tion of algorithms in a parallel environment and hides im-plementation details and much of the underlying techni-cal complexity. This component provides different build-ing blocks for implementing well-studied parallel pro-gramming design patterns, such as MapReduce (Deanand Ghemawat, 2008). For example, we would just

write result = (double) Parallel.reduce( index ->x[index]* y[index], ADD, 0, SIZE-1) to compute the dot productof two vectors in parallel. The HAL does not only imple-ment structured parallel programming design patterns,but it is also responsible for task scheduling and parallelexecution. It allows the adjustment of parallel executionparameters, including the task scheduling algorithm.

The reflective kernel is at the core of NOESIS andprovides its main features. The reflective kernel pro-vides the base models (data structures) and tasks (al-gorithms) needed to perform network data mining, aswell as the corresponding meta-objects and meta-models,which can be manipulated at run time. It is the under-lying layer that supports a large collection of networkanalysis algorithms and data mining techniques, whichare described in the following section. Different typesof networks are dealt with using an unified interface, al-lowing us to choose the particular implementation thatis the most adequate for the spatial and computationalrequirements of each application. Algorithms providedby this subsystem are built on top of the HAL buildingblocks, allowing the parallelized execution of algorithmswhenever possible.

The data access layer (DAL) provides an unified in-terface to access external data sources. This subsystemallows reading and writing networks in different file for-mats, providing implementations for some of the mostimportant standardized network file formats. This mod-ule also enables the development of data access compo-nents for other kinds of data sources, such as networkstreaming.

Finally, an application generator is used to builda complete graphical user interface following a modeldriven software development (MDSD) approach. Thiscomponent provides a friendly user interface that al-lows users without programming skills to use most of theNOESIS framework features.

B. Core classes

The core classes and interfaces shown in Figure 2 pro-vide the foundation for the implementation of differenttypes of networks with specific spatial and computa-tional requirements. Basic network operations includeadding and removing nodes, adding and removing links,or querying a node neighborhood. More complex opera-tions are provided through specialized components.

NOESIS supports networks with attributes both intheir nodes and their links. These attributes are definedaccording to predefined data models, including categori-cal and numerical values, among others.

C. Supported data formats

Different file formats have been proposed for networkdatasets. Some data formats are more space efficient,

Page 3: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

II The design of the NOESIS framework 3

AttributeNetwork

Attribute

LinkAttribute

Network

NetworkReaderNetworkWriter NetworkRenderer

FIG. 2 Some of the NOESIS core classes and interfaces rep-resented as an UML class diagram.

whereas others are more easily parseable.

NOESIS supports reading and writing network datasets using the most common data formats. For example,the GDF file format is a CSV-like format used by somesoftware tools such as GUESS and Gephi. It supportsattributes in both nodes and links. Another supportedfile format is GML, which stands for Graph ModelingLanguage. GML is a hierarchical ASCII-based file for-mat. GraphML is another hierarchical file format basedon XML, the ubiquitous eXtensible Markup Languagedeveloped by the W3C.

Other file formats are supported by NOESIS, such asthe Pajek file format, which is similar to GDF, or thefile format of the datasets from the Stanford NetworkAnalysis Platform (SNAP) (Leskovec and Krevl, 2014).

D. Graphical user interface

In order to allow users without programming knowl-edge to use most of the NOESIS features, a lightweighteasy–to–use graphical user interface is included with thestandard NOESIS framework distribution. The NOESISGUI allows non–technical end users loading, visualizing,and analyzing their own network data sets by applyingall the techniques provided with NOESIS.

Some screenshots of this GUI are shown in Figure 3.A canvas is used to display the network in every mo-ment. The network can be manipulated by clicking ordragging nodes. At the top of the window, a menu givesaccess to different options and data mining algorithms.The Network menu allows loading a network from anexternal source and exporting the results using differentfile formats, as well as creating images of the currentnetwork visualization both as raster and vector graphicsimage. The View menu allows the customization of thenetwork appearance by setting specific layout algorithmsand custom visualization styles. In addition, this menuallows binding the visual properties of nodes and links totheir attributes. The Data menu allows the explorationof attributes for each node and link. Finally, the Analysismenu gives access to most of the techniques that will bedescribed in the following sections.

III. NETWORK ANALYSIS TOOLS

NOESIS is designed to ease the implementation of net-work analysis tools. It also includes reusable implemen-tations of a large collection of popular network–relatedtechniques, from graph visualization (Tamassia, 2013)and common graph algorithms, to network structuralproperties (Newman, 2010) and network formation mod-els (Jackson, 2008). The network analysis tools includedin NOESIS and the modules that implement them areintroduced in this section.

A. Network models

NOESIS implements a number of popular random net-work generation models, which are described by probabil-ity distributions or random processes. Such models havebeen found to be useful in the study and understandingof certain properties or behaviors observed in real-worldnetworks.

Among the models included in NOESIS, the Erdos-Renyi model (Erdos and Renyi, 1959) is one of the sim-plest ones. The Gilbert model (Gilbert, 1959) is similarbut a probability of existence is given for links instead.The anchored network model is also similar to the twoprevious models, with the advantage of reducing the oc-currence of isolated nodes, but at the cost of being lessthan perfectly random. Finally, the connected randommodel is a variation of the anchored model that avoidsisolated nodes.

Other models included in NOESIS exhibit specificproperties often found in real-world networks. For ex-ample, the Watts–Strogatz model (Watts and Strogatz,1998) generates networks with small-world properties,that is, low diameter and high clustering. This modelstarts by creating a ring lattice with a given number ofnodes and a given mean degree, where each node is con-nected to its nearest neighbors on both sides. In thefollowing steps, each link is rewired to a new target nodewith a given probability, avoiding self-loops and link du-plication.

Despite the small-world properties exhibited by net-works generated by the Watts–Strogatz model are closerto real world networks than those generated by modelsbased on the Erdos-Renyi approach, they still lack someimportant properties observed in real networks. TheBarabasi–Albert model (Albert and Barabasi, 2002) isanother well-known model that generates networks whosenode degree distribution follows a power law, which leadsto scale-free networks. This model is driven by a pref-erential attachment process, where new nodes are addedand connected to existing nodes with a probability pro-portional to their current degree. Another model withvery similar properties to Barabasi–Albert’s model is thePrice’s citation model (Newman, 2003).

In addition to random network models, a number ofregular network models are included in NOESIS. These

Page 4: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

4 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

FIG. 3 Different screenshots of the NOESIS graphical user interface.

FIG. 4 Random networks generated using the Erdos-Renyi model (left), the Watts–Strogatz model (center), and the Barabasi–Albert model (right).

models generate synthetic networks that are useful in theprocess of testing new algorithms. The networks regularmodels include complete networks, where all nodes areinterconnected; star networks, where all nodes are con-nected to a single hub node; ring networks, where eachnode is connected to its closest two neighbors along aring; tandem networks, like ring model but without clos-ing the loop; mesh network, where nodes are arranged inrows and columns, and connected only to their adjacentnodes; toruses, meshes where nodes in the extremes ofthe mesh are connected; hypercubes; binary trees; andisolates, a network without links.

B. Structural properties of networks

Network structural properties allow the quantificationof features or behaviors present in the network. Theycan be used, for instance, to measure network robustnessor reveal important nodes and links. NOESIS consid-ers three types of structural properties: node properties,node pair properties (for pairs both with and withoutlinks among them), and global properties.

NOESIS provides a large number of techniques for an-alyzing network structural properties. Many structuralproperties can be computed for nodes. For example, in-degree and out-degree, indicate the number of incomingand outgoing links, respectively. Related to node degree,two techniques to measure node degree assortativity have

Page 5: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

III Network analysis tools 5

been included: biased (Piraveenan et al., 2008) and unbi-ased (Piraveenan et al., 2010) node degree assortativity.Node assortativity is a score between −1 and 1 that mea-sures the degree correlation between pairs of connectednodes. The clustering coefficient can also be computedfor nodes. The clustering coefficient of a node is the frac-tion of its neighbors that are also connected among them.

Reachability scores are centrality measures that allowthe analysis of how easy it is to reach a node from othernodes. The eccentricity of a node is defined as the max-imum distance to any other node (Hage and Harary,1995). The closeness, however, is the inverse of the sumof the distance from a given node to all others (Bavelas,1950). An adjusted closeness value that normalizes thecloseness according to the number of reachable nodes canalso be used. Inversely to closeness, average path lengthis defined as the mean distance of all shortest paths toany other node. Decay is yet another reachability score,computed as the summation of a delta factor powered bythe path length to any other node (Jackson, 2008). It isinteresting to note that with a delta factor close to 0, themeasure becomes the degree of the node, whereas with adelta close to 1, the measure becomes the component sizeof the component the node is located at. A normalizeddecay score is also available.

Betweenness, as reachability, is another way to mea-sure node centrality. Betweenness, also known as Free-man’s betweenness, is a score computed as the count ofshortest paths the node is involved in (Freeman, 1977).Since this score ranges from 2n − 1 to n2 − (n − 1) forn the number of nodes in strongly-connected networks, anormalized variant is typically used.

Finally influence algorithms provide a different per-spective on node centrality. These techniques measurethe power of each node to affect others. The most popu-lar influence algorithm is PageRank (Page et al., 1999),since it is used by the Google search engine. PageR-ank computes a probability distribution based on thelikelihood of reaching a node starting from any othernode. The algorithm works by iteratively updating nodeprobability based on direct neighbors probabilities, whichleads to convergence if the network satisfies certain prop-erties. A similar algorithm is HITS (Kleinberg, 1999),which stands for hyperlink-induced topic search. It fol-lows an iterative approach, as PageRank, but computestwo scores per node: the hub, which is a score relatedto how many nodes a particular node links, and the au-thority, which is a score related to how many hubs linka particular node. Both scores are connected by an it-erative updating process: authority is updated accord-ing to the hub scores of nodes connected by incominglinks and hub is updated according to authority scores ofnodes connected by outgoing links. Eigenvector central-ity is another iterative method closely related to PageR-ank, where nodes are assigned a centrality score basedon the summation of the centrality of their neighborsnodes. Katz centrality considers all possible paths, butpenalizes long ones using a given damping factor (Katz,

1953). Finally, diffusion centrality (Kang et al., 2012)is another influence algorithm based on Katz centrality.The main difference is that, while Katz considers infinitelength paths, diffusion centrality considers only paths ofa given limited length.

In the following example, we show how to load a net-work from a data file and compute its structural proper-ties using NOESIS, its PageRank scores in particular:

FileReader fileReader =new FileReader("karate.gml");

NetworkReader reader =new GMLNetworkReader(fileReader);

Network network = reader.read();PageRank task = new PageRank(network);NodeScore score = task.call();

Different structural properties for links can also becomputed by NOESIS. For example, link betweenness,which is the count of shortest paths the link is involvedin, or link rays, which is the number of possible pathsbetween two nodes that cross a given link. Some of theseproperties are used by different network data mining al-gorithms.

C. Network visualization techniques

Humans are still better than machines at the recogni-tion of certain patterns when analyzing data in a visualway. Network visualization is a complex task since net-works tend to be huge, with thousands nodes and links.NOESIS enables the visualization of networks by provid-ing the functionality needed to render the network andexport the resulting visualization using different imagefile formats.

NOESIS provides different automatic graph lay-out techniques, such as the well–known Fruchterman–Reingold (Fruchterman and Reingold, 1991) andKamada–Kawai (Kamada and Kawai, 1989) force–basedlayout algorithms. Force–based layout algorithms assignforces among pairs of nodes and solve the system to reachan equilibrium point, which usually leads to an aestheticvisualization.

Hierarchical layouts (Tamassia, 2013), which arrangenodes in layers trying to minimize edge crossing, are alsoincluded. Different radial layout algorithms are includedas well (Wills, 1999). These layouts are similar to the hi-erarchical ones, but arrange nodes in concentric circles.Finally, several regular layouts are included. These lay-outs are common for visualizing regular networks, suchas meshes or stars.

NOESIS allows tuning the network visualization lookand feel. The visual properties of nodes and links canbe customized, including color, size, borders, and so on.In addition, visual properties can be bound to static ordynamic properties of the network. For example, nodesizes can be bound to a specific centrality score, allowingthe visual display of quantitative information.

Page 6: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

6 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

FIG. 5 A dolphin social network (Lusseau et al., 2003) represented using different network visualization algorithms: randomlayout (top left), Kamada–Kawai layout (top right), Fruchterman–Reingold layout (bottom left), and circular layout usingaverage path length (bottom right).

IV. NETWORK DATA MINING TECHNIQUES

Network data mining techniques exist for both un-supervised and supervised settings. NOESIS includesa wide array of community detection methods (Lanci-chinetti and Fortunato, 2009) and link prediction tech-niques (Liben-Nowell and Kleinberg, 2007). These algo-rithms are briefly described below.

A. Community detection

Community detection can be defined as the task offinding groups of densely connected nodes. A wide rangeof community detection algorithms have been proposed,exhibiting different pros and cons. NOESIS features dif-ferent families of community detection techniques andimplements more than ten popular community detectionalgorithms. The included algorithms, their time com-plexity, and their bibliographic references are shown inTable I.

NOESIS provides hierarchical clustering algorithms.Agglomerative hierarchical clustering treats each node

Page 7: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

IV Network data mining techniques 7

FIG. 6 Community detection methods applied to Zachary’s karate club network (Zachary, 1977): Fast greedy partitioning(top left), Kernighan-Lin bi-partitioning (top right), average-link hierarchical partitioning (bottom left), and complete-linkhierarchical partitioning (bottom right).

as a cluster, and then iteratively merges clusters until allnodes are in the same cluster (Fortunato, 2010). Differentstrategies for the selection of clusters to merge have beenimplemented, including single-link (Sibson, 1973), whichselects the two clusters with the smallest minimum pair-wise distance; complete-link (Defays, 1977), which selectsthe two clusters with the smallest maximum pairwise dis-tance; and average-link (Liu, 2011), which selects the twoclusters with the smallest average pairwise distance.

Modularity-based techniques are also available in ourframework. Modularity is a score that measures thestrength of particular division into modules of a givennetwork. Modularity–based techniques search for com-

munities by attempting to maximize their modularityscore (Newman and Girvan, 2004). Different greedystrategies, including fast greedy (Newman, 2004) andmulti-step greedy (Schuetz and Caflisch, 2008), are avail-able. These greedy algorithms merge pairs of clustersthat maximize the resulting modularity, until all possi-ble merges would reduce the network modularity.

Partitional clustering is another common approach.Partitioning clustering decomposes the network and per-forms an iterative relocation of nodes between clusters.For example, Kernighan-Lin bi-partitioning (Kernighanand Lin, 1970) starts with an arbitrary partition in twoclusters. Then, iteratively exchanges nodes between both

Page 8: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

8 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

Type Name Time complexity Reference

HierarchicalSingle-link (SLINK) O(v2) (Sibson, 1973)

Complete-link (CLINK) O(v2 log v) (Defays, 1977)

Average-link (UMPGA) O(v2 log v) (Liu, 2011)

ModularityFast greedy O(kvd log v) (Newman, 2004)

Multi-step greedy O(kvd log v) (Schuetz and Caflisch, 2008)

PartitionalKernighan-Lin bi-partitioning O(v2 log v) (Kernighan and Lin, 1970)

K-means O(kvd) (MacQueen et al., 1967)

SpectralRatio cut algorithm (EIG1) O(v3) (Hagen and Kahng, 1992)

Jordan and Weiss NG algorithm (KNSC1) O(v3) (Ng et al., 2002)

Spectral k-means O(v3) (Shi and Malik, 2000)

Overlapping BigClam O(v2) (Yang and Leskovec, 2013)

TABLE I

Computational time complexity and bibliographic references for the community detection techniques provided by NOESIS. Inthe time complexity analysis, v is the number of nodes in the network, d is the maximum node degree, and k is the desired

number of clusters.

clusters to minimize the number of links between them.This approach can be applied multiple times to subdi-vide the obtained clusters. K-means community detec-tion (MacQueen et al., 1967) is an application of the tra-ditional k-means clustering algorithm to networks andanother prominent example of partitioning communitydetection.

Spectral community detection (Fortunato, 2010) is an-other family of community detection techniques includedin NOESIS. These techniques use the Laplacian represen-tation of the network, which is a network representationcomputed by subtracting the adjacency matrix of the net-work to a diagonal matrix where each diagonal elementis equal to the degree of the corresponding node. Then,the eigenvectors of the Laplacian representation of thenetwork are computed. NOESIS includes the ratio cutalgorithm (EIG1) (Hagen and Kahng, 1992), the Jordanand Weiss NG algorithm (KNSC1) (Ng et al., 2002), andspectral k-means (Shi and Malik, 2000).

Finally, the BigClam overlapping community detectoris also available in NOESIS (Yang and Leskovec, 2013).In this algorithm, each node has a profile, which consistsin a score between 0 and 1 for each cluster that is pro-portional to the likelihood of the node belonging to thatcluster. Also, a score between pairs of nodes is definedyielding values proportional to their clustering assign-ment overlap. The algorithm iteratively optimizes eachnode profile to maximize the value between connectednodes and minimize the value among unconnected nodes.

In the following example, we show how to load a net-work from a data file and detect communities with theKNSC1 algorithm using NOESIS:

FileReader fileReader =new FileReader("mynetwork.net");

NetworkReader reader =new PajekNetworkReader(fileReader);

Network network = reader.read();CommunityDetector task =

new NJWCommunityDetector(network);Matrix results = task.call();

B. Link scoring and prediction

Link scoring and link prediction are two closely relatedtasks. On the one hand, link scoring aims to compute avalue or weight for a link according to a specific criteria.Most link scoring techniques obtain this value by consid-ering the overlap or relationship between the neighbor-hood of the nodes at both ends of the link. On the otherhand, link prediction computes a value, weight, or prob-ability proportional to the likelihood of the existence of acertain link according to a given model of link formation.

The NOESIS framework provides a large collection ofmethods for link scoring and link prediction, from localmethods, which only consider the direct neighborhood ofnodes, to global methods, which consider the whole net-work topology. As the amount of information consideredis increased, the computational and spatial complexity ofthe techniques also increases. The link scoring and pre-diction methods available in NOESIS are shown in TableII.

Among local methods, the most basic technique is thecommon neighbors score (Newman, 2001), which is equalto the number of shared neighbors between a pair ofnodes. Most techniques are variations of the commonneighbors score. For example, the Adamic–Adar score(Adamic and Adar, 2003) is the sum of one divided bythe logarithm of the degree of each shared node. Theresource–allocation index (Zhou et al., 2009) follows thesame expression, but directly considers the degree in-stead of the logarithm of the degree. The adaptive de-gree penalization score (Martınez et al., 2016) also followsthe same approach, yet automatically determines an ad-equate degree penalization by considering properties ofthe network topology. Other local measures consider the

Page 9: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

IV Network data mining techniques 9

Type Name Time complexity Reference

Local

Common Neighbors count O(vd3) (Newman, 2001)

Adamic–Adar score O(vd3) (Adamic and Adar, 2003)

Resource–allocation index O(vd3) (Zhou et al., 2009)

Adaptive degree penalization score O(vd3) (Martınez et al., 2016)

Jaccard score O(vd3) (Jaccard, 1901)

Leicht-Holme-Newman score O(vd3) (Leicht et al., 2006)

Salton score O(vd3) (Salton and McGill, 1986)

Sorensen score O(vd3) (Sørensen, 1948)

Hub promoted index O(vd3) (Ravasz et al., 2002)

hub depressed index O(vd3) (Ravasz et al., 2002)

Preferential attachment score O(vd2) (Barabasi and Albert, 1999)

Global

Katz index O(v3) (Katz, 1953)

Leicht-Holme-Newman score O(cv2d) (Leicht et al., 2006)

Random walk O(cv2d) (Pearson, 1905)

Random walk with restart O(cv2d) (Tong et al., 2006)

Flow propagation O(cv2d) (Vanunu and Sharan, 2008)

Pseudoinverse Laplacian score O(v3) (Fouss et al., 2007)

Average commute time score O(v3) (Fouss et al., 2007)

Random forest kernel index O(v3) (Chebotarev and Shamis, 2006)

TABLE II

Computational time complexity and bibliographic references for the link scoring and prediction methods provided byNOESIS. In the time complexity analysis, v is the number of nodes in the network, d is the maximum node degree, and c

refers to the number of iterations required by iterative global link prediction methods.

number of shared neighbors, but normalize their valueaccording to certain criteria. For example, the Jaccardscore (Jaccard, 1901) normalizes the number of sharedneighbors by the total number of neighbors. The localLeicht-Holme-Newman score (Leicht et al., 2006) normal-izes the count of shared neighbors by the product of bothneighborhoods sizes. Similarly, the Salton score (Saltonand McGill, 1986) also normalizes, this time using thesquare root of the product of both node degrees. TheSorensen score (Sørensen, 1948) considers the double ofthe count of shared neighbors normalized by the sumof both neighbors size. The hub promoted and hub de-pressed scores (Ravasz et al., 2002) normalize the countof shared neighbors by the minimum and the maximumof both nodes degree respectively. Finally, the preferen-tial attachment score (Barabasi and Albert, 1999) onlyconsiders the product of both node degrees.

Global link scoring and prediction methods are morecomplex than local methods. For example, the Katz score(Katz, 1953) sums the influence of all possible paths be-tween two nodes, incrementally penalizing paths by theirlength according to a given damping factor. The globalLeicht-Holme-Newman score (Leicht et al., 2006) is quitesimilar to the Katz score, but resorts to the dominanteigenvalue to compute the final result.

Random walk techniques simulate a Markov chain ofrandomly-selected nodes (Pearson, 1905). The idea isthat, starting from a seed node and randomly movingthrough links, we can obtain a probability vector where

each element corresponds to the probability of reachingeach node. The classical random walk iteratively mul-tiplies the probability vector by the transition matrix,which is the row-normalized version of the adjacency ma-trix, until convergence. An interesting variant is the ran-dom walk with restart (Tong et al., 2006), which mod-els the possibility of returning to the seed node with agiven probability. Flow propagation is another variantof random walk (Vanunu and Sharan, 2008), where thetransition matrix is computed by performing both rowand column normalization of the adjacency matrix.

Finally, some spectral techniques are also available inNOESIS. Spectral techniques, as we mentioned when dis-cussing community detection methods, are based on theLaplacian matrix. The pseudoinverse Laplacian score(Fouss et al., 2007) is the inner product of the rows ofthe corresponding pair of nodes from the Laplacian ma-trix. Other spectral technique is the average commutetime (Fouss et al., 2007), which is defined as the averagenumber of steps that a random walker starting from aparticular node takes to reach another node for the firsttime and go back to the initial node. Despite it modelsa random walk process, it is considered to be a spectraltechnique because it is usually computed in terms of theLaplacian matrix. Given the Laplacian matrix, it can becomputed as the diagonal element of the starting nodeplus the diagonal element of the ending node, minus twotimes the element located in the row of the first node andthe column of the second one.

Page 10: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

10 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

FIG. 7 Different link scoring techniques applied to Les Miserables coappearance network (Knuth, 2009): common neighbors(top left), preferential attachment score (top right), Sorensen score (bottom left), and Katz index (bottom right). Link widthin the figure is proportional to the link score.

Finally, the random forest kernel score (Chebotarevand Shamis, 2006) is a global technique based on theconcept of spanning tree, i.e. a connected undirectedsub-network with no cycles that includes all the nodesand some or all the links of the network. The matrix-treetheorem states that the number of spanning trees in thenetwork is equal to any cofactor, which is a determinantobtained by removing the row and column of the givennode, of an entry of its Laplacian representation. As aresult of this, the inverse of the sum of the identity matrixand the Laplacian matrix gives us a matrix that can beinterpreted as a measure of accessibility between pairs ofnodes.

Using network data mining algorithms in NOESIS issimple. In the following code snippet, we show how togenerate a Barabsi-Albert preferential attachment net-work with 100 nodes and 1000 links, and then computethe Resource Allocation score for each pair of nodes usingNOESIS:

Network network =new BarabasiAlbertNetwork(100, 1000);

LinkPredictionScore method =new ResourceAllocationScore(network);

Matrix result = method.call();

Page 11: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

V Conclusion 11

V. CONCLUSION

Currently, the NOESIS project comprises more thanthirty five thousand lines of code organized in hundredsof classes and dozens of packages. NOESIS relies on a li-brary of reusable components that, with more than fortythousand lines of Java code, provide a customizable col-lection framework, support for the execution of parallelalgorithms, mathematical routines, and the model-drivenapplication generator used to build the NOESIS graphi-cal user interface.

NOESIS can ease the development of applications thatinvolve the analysis of any kind of data susceptible ofbeing represented as a network. NOESIS provides anefficient, scalable, and developer–friendly framework fornetwork data mining, released under a permissive Berke-ley Software Distribution free software license. Ourframework can be downloaded from its official websiteat http://noesis.ikor.org.

Acknowledgments

The NOESIS project is partially supported by theSpanish Ministry of Economy and the European RegionalDevelopment Fund (FEDER), under grant TIN2012–36951, and the Spanish Ministry of Education under theprogram “Ayudas para contratos predoctorales para laformacion de doctores 2013” (grant BES–2013–064699).We are grateful to Aaron Rosas, Francisco–Javier Gijon,and Julio–Omar Palacio for their contributions to the im-plementation of community detection methods in NOE-SIS.

References

Adamic, L. A. and Adar, E. (2003). Friends and neighborson the web. Social Networks, 25(3):211–230.

Albert, R. and Barabasi, A.-L. (2002). Statistical mechanicsof complex networks. Reviews of Modern Physics, 74(1):47.

Barabasi, A.-L. and Albert, R. (1999). Emergence of scalingin random networks. Science, 286(5439):509–512.

Bastian, M., Heymann, S., Jacomy, M., et al. (2009). Gephi:an open source software for exploring and manipulatingnetworks. International AAAI Conference on Weblogs andSocial Media, 8:361–362.

Batagelj, V. and Mrvar, A. (1998). Pajek-program for largenetwork analysis. Connections, 21(2):47–57.

Bavelas, A. (1950). Communication patterns in task-orientedgroups. The Journal of the Acoustical Society of America,22(6):725–730.

Borgatti, S. P., Everett, M. G., and Freeman, L. C. (2002).UCINET for Windows: Software for social network analy-sis. Technical report, Analytic Technologies.

Chebotarev, P. and Shamis, E. (2006). Matrix-forest theo-rems. arXiv preprint math/0602575.

Csardi, G. and Nepusz, T. (2006). The igraph software pack-age for complex network research. International Journal ofComplex Systems, 1695(5):1–9.

Dean, J. and Ghemawat, S. (2008). MapReduce: simplifieddata processing on large clusters. Communications of theACM, 51(1):107–113.

Defays, D. (1977). An efficient algorithm for a complete linkmethod. The Computer Journal, 20(4):364–366.

Ellson, J., Gansner, E., Koutsofios, L., North, S. C., andWoodhull, G. (2002). Graphviz - open source graph draw-ing tools. In Graph Drawing, pages 483–484. Springer.

Erdos, P. and Renyi, A. (1959). On random graphs. Publica-tiones Mathematicae Debrecen, 6:290–297.

Fortunato, S. (2010). Community detection in graphs. PhysicsReports, 486(3):75–174.

Fouss, F., Pirotte, A., Renders, J.-M., and Saerens, M. (2007).Random-walk computation of similarities between nodes ofa graph with application to collaborative recommendation.IEEE Transactions on Knowledge and Data Engineering,19(3):355–369.

Freeman, L. C. (1977). A set of measures of centrality basedon betweenness. Sociometry, pages 35–41.

Fruchterman, T. M. and Reingold, E. M. (1991). Graph draw-ing by force-directed placement. Software: Practice andExperience, 21(11):1129–1164.

Getoor, L. and Diehl, C. P. (2005). Link mining: a survey.ACM SIGKDD Explorations Newsletter, 7(2):3–12.

Gilbert, E. N. (1959). Random graphs. The Annals of Math-ematical Statistics, 30(4):1141–1144.

Hage, P. and Harary, F. (1995). Eccentricity and centralityin networks. Social Networks, 17(1):57–63.

Hagen, L. and Kahng, A. B. (1992). New spectral methods forratio cut partitioning and clustering. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems,11(9):1074–1085.

Jaccard, P. (1901). Etude comparative de la distribution flo-rale dans une portion des alpes et des jura. Bulletin de laSociete Vaudoise des Sciences Naturelles, 37:547–579.

Jackson, M. O. (2008). Social and Economic Networks.Princeton University Press, Princeton, NJ, USA.

Kamada, T. and Kawai, S. (1989). An algorithm for drawinggeneral undirected graphs. Information Processing Letters,31(1):7–15.

Kang, C., Molinaro, C., Kraus, S., Shavitt, Y., and Subrah-manian, V. (2012). Diffusion centrality in social networks.In Proceedings of the 2012 International Conference on Ad-vances in Social Networks Analysis and Mining (ASONAM2012), pages 558–564. IEEE Computer Society.

Katz, L. (1953). A new status index derived from sociometricanalysis. Psychometrika, 18(1):39–43.

Keeling, M. J. and Eames, K. T. (2005). Networks andepidemic models. Journal of the Royal Society Interface,2(4):295–307.

Kernighan, B. W. and Lin, S. (1970). An efficient heuristicprocedure for partitioning graphs. Bell System TechnicalJournal, 49(2):291–307.

Kleinberg, J. M. (1999). Authoritative sources in a hy-perlinked environment. Journal of the ACM (JACM),46(5):604–632.

Knuth, D. E. (2009). The Stanford GraphBase: A Plat-form for Combinatorial Computing. Addison-Wesley Pro-fessional, 1st edition.

Lancichinetti, A. and Fortunato, S. (2009). Community detec-tion algorithms: a comparative analysis. Physical ReviewE, 80(5):056117.

Leicht, E. A., Holme, P., and Newman, M. E. (2006). Vertex

Page 12: arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 · it can be applied to. This paper presents NOESIS, an open source framework for network data ... designed using solid object{oriented design

12 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

similarity in networks. Physical Review E, 73(2):026120.Leskovec, J. and Krevl, A. (2014). SNAP Datasets: Stanford

large network dataset collection. http://snap.stanford.

edu/data.Liben-Nowell, D. and Kleinberg, J. (2007). The link-

prediction problem for social networks. Journal of theAmerican Society for Information Science and Technology,58(7):1019–1031.

Liu, B. (2011). Web Data Mining, 2 ed. Berlin, Germany:Springer Berlin Heidelberg.

Lu, L. and Zhou, T. (2011). Link prediction in complex net-works: A survey. Physica A: Statistical Mechanics and itsApplications, 390(6):1150–1170.

Lusseau, D., Schneider, K., Boisseau, O. J., Haase, P.,Slooten, E., and Dawson, S. M. (2003). The bottlenosedolphin community of doubtful sound features a large pro-portion of long-lasting associations. Behavioral Ecology andSociobiology, 54(4):396–405.

MacQueen, J. et al. (1967). Some methods for classificationand analysis of multivariate observations. In Proceedings ofthe Fifth Berkeley Symposium on Mathematical Statisticsand Probability, number 14 in 1, pages 281–297. Oakland,CA, USA.

Martınez, V., Berzal, F., and Cubero, J.-C. (2016). Adaptivedegree penalization for link prediction. Journal of Compu-tational Science, 13:1–9.

Martınez, V., Cano, C., and Blanco, A. (2014). Prophnet: Ageneric prioritization method through propagation of infor-mation. BMC Bioinformatics, 15(Suppl 1):S5.

Newman, M. (2010). Networks: An Introduction. OxfordUniversity Press.

Newman, M. E. (2001). Clustering and preferential at-tachment in growing networks. Physical Review E,64(2):025102.

Newman, M. E. (2003). The structure and function of com-plex networks. Society for Industrial and Applied Mathe-matics Review, 45(2):167–256.

Newman, M. E. (2004). Fast algorithm for detectingcommunity structure in networks. Physical Review E,69(6):066133.

Newman, M. E. and Girvan, M. (2004). Finding and evaluat-ing community structure in networks. Physical Review E,69(2):026113.

Ng, A. Y., Jordan, M. I., Weiss, Y., et al. (2002). On spectralclustering: Analysis and an algorithm. Advances in NeuralInformation Processing Systems, 2:849–856.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1999).The pagerank citation ranking: Bringing order to the web.Technical report, Stanford InfoLab.

Pavlov, M. and Ichise, R. (2007). Finding experts by linkprediction in co-authorship networks. Finding Experts onthe Web with Semantics Workshop, 290:42–55.

Pearson, K. (1905). The problem of the random walk. Nature,72:342.

Piraveenan, M., Prokopenko, M., and Zomaya, A. (2008). Lo-cal assortativeness in scale-free networks. Europhysics Let-ters, 84(2):28002.

Piraveenan, M., Prokopenko, M., and Zomaya, A. Y. (2010).Classifying complex networks using unbiased local assorta-tivity. In ALIFE, pages 329–336.

Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N., and

Barabasi, A.-L. (2002). Hierarchical organization of mod-ularity in metabolic networks. Science, 297(5586):1551–1555.

Salton, G. and McGill, M. J. (1986). Introduction to ModernInformation Retrieval. McGraw-Hill, Inc.

Schuetz, P. and Caflisch, A. (2008). Efficient modularity op-timization by multistep greedy algorithm and vertex moverrefinement. Physical Review E, 77(4):046112.

Schult, D. A. and Swart, P. (2008). Exploring network struc-ture, dynamics, and function using NetworkX. In Proceed-ings of the 7th Python in Science Conferences (SciPy 2008),volume 2008, pages 11–16.

Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang,J. T., Ramage, D., Amin, N., Schwikowski, B., andIdeker, T. (2003). Cytoscape: a software environmentfor integrated models of biomolecular interaction networks.Genome Research, 13(11):2498–2504.

Shi, J. and Malik, J. (2000). Normalized cuts and image seg-mentation. Transactions on Pattern Analysis and MachineIntelligence, 22(8):888–905.

Sibson, R. (1973). Slink: an optimally efficient algorithmfor the single-link cluster method. The Computer Journal,16(1):30–34.

Smith, M. A., Shneiderman, B., Milic-Frayling, N.,Mendes Rodrigues, E., Barash, V., Dunne, C., Capone, T.,Perer, A., and Gleave, E. (2009). Analyzing (social media)networks with NodeXL. In Proceedings of the Fourth In-ternational Conference on Communities and Technologies,pages 255–264. ACM.

Sørensen, T. (1948). A method of establishing groups of equalamplitude in plant sociology based on similarity of speciesand its application to analyses of the vegetation on danishcommons. Biologiske Skrifter, 5:1–34.

Tamassia, R. (2013). Handbook of Graph Drawing and Visu-alization. CRC Press.

Tan, P.-N., Steinbach, M., Kumar, V., et al. (2006). Intro-duction to data mining. Pearson, Addison Wesley, Boston.

Tong, H., Faloutsos, C., and Pan, J.-Y. (2006). Fast randomwalk with restart and its applications. In Proceedings ofthe Sixth International Conference on Data Mining, ICDM’06, pages 613–622. IEEE Computer Society.

Vanunu, O. and Sharan, R. (2008). A propagation-based al-gorithm for inferring gene-disease assocations. In GermanConference on Bioinformatics, pages 54–52.

Watts, D. J. and Strogatz, S. H. (1998). Collective dynamicsof ‘small-world’ networks. Nature, 393(6684):440–442.

Wills, G. J. (1999). Nicheworksinteractive visualization ofvery large graphs. Journal of Computational and GraphicalStatistics, 8(2):190–212.

Yang, J. and Leskovec, J. (2013). Overlapping communitydetection at scale: a nonnegative matrix factorization ap-proach. In Proceedings of the Sixth ACM InternationalConference on Web Search and Data Mining, pages 587–596. ACM.

Zachary, W. W. (1977). An information flow model for con-flict and fission in small groups. Journal of AnthropologicalResearch, pages 452–473.

Zhou, T., Lu, L., and Zhang, Y.-C. (2009). Predicting missinglinks via local information. The European Physical JournalB, 71(4):623–630.