Network analysis Sushmita Roy BMI/CS 576 [email protected] Dec 3 rd, 2013.

42
Network analysis Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576 [email protected] Dec 3 rd , 2013

Transcript of Network analysis Sushmita Roy BMI/CS 576 [email protected] Dec 3 rd, 2013.

Page 1: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Network analysis

Sushmita RoyBMI/CS 576

www.biostat.wisc.edu/[email protected]

Dec 3rd, 2013

Page 2: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Key concepts

• Network measures– Degree– Degree distribution– Average path length and shortest path length– Clustering coefficient– Modularity– Network motifs– Centrality measures

• Network models– Random networks– Scale free networks

Page 3: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Directed and undirected networks

Undirected network

Vertex/Node

Edge Directed Edge

Directed network

A

B C

D

E

F

A

B C

D

E

F

Page 4: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Node degree

• Undirected network– Degree, k: Number of neighbors of a node

• Directed network– Indegree, kin: Number of incoming edges

– Out degree, kout: Number of outgoing edges

• Average degree (undirected network)

Directed Edge

A

B C

D

E

FIndegree of F is 4Outdegree of E is 1

Page 5: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Average degree

• Consider an undirected network with N nodes and L edges

• Let ki denote the degree of node i• Average degree is

• Average degree is equivalently defined as

Page 6: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Degree distribution

• P(k) gives the probability that a selected node has k edges

• Different networks can have different degree distributions

• A fundamental property that can be used to characterize a network

Page 7: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Different degree distributions

• Poisson distribution– The mean is a good representation of ki of all nodes– Exhibited in Erdos Renyi networks

• Power law distribution– Also called scale free – There is no “typical” node that captures the degree of

nodes.

Page 8: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Poisson distribution

• A discrete distribution

• The Poisson is parameterized by which can be easily estimated by maximum likelihood

k

P(X=k)

Page 9: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Power law distribution

• Used to capture the degree distribution of most biological/real networks

• Typical value of is between 2 and 3.

• MLE exists for but is more complicated– See Power-Law Distributions in

Empirical Data. Clauset, Shalizi and Newman, 2009 for details

P(k)

Page 10: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Erdos Renyi random graphs

• Dates back to 1960 due to two mathematicians Paul Erdos and Alfred Renyi.

• Provides a probabilistic model to generate a graph• Starts with N nodes and connects two nodes with

probability p• Node degrees follow a Poisson distribution• Tail falls off exponentially, suggesting that nodes with

degrees different from the mean are very rare

Page 11: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Generating a graph using the ER model

• Input – p: probability of an edge– N: number of nodes in the network

• Output: An ER network of N nodes with on p*N(N-1)/2 edges on average

• For each possible edge add with probability p

Page 12: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Scale free networks

• Degree distribution is captured by a power law distribution

• Such networks are ubiquitous in nature• Scale-free networks can be generated by the

preferential attachment model from Barabasi-Albert• A “rich gets richer” model

Page 13: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Generating a Scale free network with the preferential attachment model

• Input:– N: number of nodes– m: number of existing nodes to connect

• Output: a scale-free network• At each iteration– Add a node with m connections– Select a node i as one of the m neighbors with probability

Page 14: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Poisson versus Scale free

Barabasi & Oltvai

Page 15: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Path lengths

• The shortest path length between two nodes A and B:– The smallest number of edges that need to be traversed to

get from A to B

• Mean path length is the average of all shortest path lengths

• Diameter of a graph is the longest of all shortest paths in the network

Page 16: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Scale-free networks are ultra-small

• Average path length is log log N

• In a random network (Erdos Renyi network) the average path length is log N

Page 17: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Clustering coefficient

• Measure of transitivity in the network– If A is connect to B, and B is connected to C, how often is A connected to C

• Clustering coefficient Ci for each node i is

• ni is the number of edges among neighbors of i• The ratio of the number of edges connecting i’s neighbors to the

max possible• Average clustering coefficient gives a measure of nodes to form

clusters

A

BC

?

Page 18: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Clustering coefficient example

A

C

BG

D

Page 19: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Let’s look at some large networks

• We will consider networks of 800-1000 nodes• One is generated using the Preferential attachment

model• One is generated using the ER model

Page 20: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Networks generated from the different models

Preferential attachment ER random network

Page 21: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Degree distributions of the two networks

Preferential attachment ER random network

Page 22: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Comparing other properties of the networks

Page 23: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Relationship between clustering coefficient and degree

• Define C(k) as the average clustering coefficient of all nodes with degree k

• In some networks

• If this is true, the networks are said to have a hierarchical organization

• Smaller node sets are linked together to form larger modules.

Page 24: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Hierarchical network

Barabasi & Oltvai, 2004

A hierarchical network generated by replicating the current set of nodes

Scale-free distribution of degrees

Inverse relationship between C(k) and degree

Page 25: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Hierarchical organization is seen also among nodes

• Regulators are hierarchically organized with different roles per level– Top: Master regulators influence

many genes– Middle: Bottle necks directly

targeting most genes– Bottom: Essential regulators

Hierarchical structure of S. cerevisiae regulatory network

Yu & Gerstein 2006, Jothi et al. 2009

Page 26: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Given a network how can we test what degree distribution it follows?

• Compute the empirical degree distribution• Degree distribution can Poisson or Power law• Estimate parameters of the distribution from the

data• Pick the distribution that fits the data better.

Page 27: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Properties of scale free networks

• Degree distribution is best captured by a power law distribution

• Average clustering coefficient is higher than expected from a random network

• Average path length is smaller than expected from a random network

Page 28: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Centrality measures in networks

• A measure of how important network node is• Four types of centrality measures defined for each

node– Degree centrality

• The degree of a node

– Betweenness centrality• The number of shortest paths between two nodes that passes

through the node of interest

– Closeness centrality• Sum of a distances from other nodes

– Eigenvector centrality• Given by the largest eigen vector of the adjacency matrix

Page 29: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Eigenvector centrality

• Based on the idea that nodes with high score should influence the importance of a node more

• Given by

• The centrality measures are given by the entries of the first eigen vector

• Google’s page rank algorithm makes use of a type of Eigen vector centrality

Neighbors of v

Largest eigen value

Page 30: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Degree centrality of a node is correlated to functional importance of a node

Red nodes on deletion cause the organism to dieRed nodes also among the most degree central

Yeast protein-protein interaction network

Page 31: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Network motifs

• Degree distributions capture important global properties of the network

• Can we say something about more local properties of the network?

• Network motifs are defined as small recurring subnetworks that occur much more than a randomized network

• A subgraph is called a network motif of a network if its occurrence in randomized networks is significantly less than the original network.

• Some motifs are associated to explain specific network dynamics

Milo Science 2002

Page 32: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Network motifs of size three in a directed network

Page 33: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Finding network motifs

• Enumerating motifs– Subgraph enumeration

• Calculating the number of occurrences in randomized networks

Milo 2002

Page 34: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Network motifs found in many complex networks

The occurrence of the feedforward loop in both networks suggests a fundamental similarity in the design on these networks

Page 35: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Structural common motifs seen in the yeast regulatory network

Lee et.al. 2002, Mangan & Alon, 2003

Auto-regulation Multi-component Feed-forward loop

Single Input Multi Input

Regulatory Chain

Feed-forward loops involved in speeding up in response of target gene

Page 36: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Modularity in networks

• Modularity “refers to a group of physically or functionally linked nodes that work together to achieve a distinct function”

-- Barabasi & Oltvai

• Similar idea is captured by the “community structure” in networks

• Two questions– Given a network is it modular?– Given a network what are the modules in the network?

Page 37: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

A modular network

Module 1

Module 2

Module 3

Page 38: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Assessing the modularity of a network

• Modularity of a network can be assessed in two ways:– Recall the average clustering coefficient– A modular network is one that has a significantly higher clustering

coefficient than a network with equivalent number of nodes and degree distribution

• If we know an existing grouping of nodes, we can compute modularity (Q) as– difference between within group (community) connections and

expected connections within a group

Q defined as in: Finding and evaluating community structure in networks, http://arxiv.org/abs/cond-mat/0308217v1

Page 39: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Finding modules in a graph

• Given a graph find the densely connected subgraphs • Graph clustering algorithms are applicable here– Hierarchical clustering using the edge weight as a distance

• How to define weight?

– Markov clustering algorithm– Girvan-Newman algorithm

Page 40: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Girvan-Newman algorithm

• Initialize– Compute betweennees for all edges

• Repeat until convergence criteria1. Remove the node with the highest betweennees2. Recompute betweenness of affected edges

• Convergence criteria can be– No more edges– Desired modularity.

Page 41: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Zachary’s karate club study

Each node is an individual and edges represent social interactions among individuals. The shape and colors represent different groups.

Node grouping based on betweenness

Page 42: Network analysis Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 3 rd, 2013.

Summary of network analysis

• Given a network, its topology can be characterized using different measures– Degree distribution– Average path length– Clustering coefficient

• Centrality measures– Allow us to assess the importance of different nodes

• Network motifs– Overrepresentation of subgraphs of specific types

• Network modularity