CS 6293 Advanced Topics: Translational Bioinformatics

130
CS 6293 Advanced Topics: Translational Bioinformatics Biological networks: Theory and applications

description

CS 6293 Advanced Topics: Translational Bioinformatics. Biological networks: Theory and applications. Lecture outline. Basic terminology and concepts in networks Some interesting results between network properties and biological functions Network clustering / community discovery - PowerPoint PPT Presentation

Transcript of CS 6293 Advanced Topics: Translational Bioinformatics

Page 1: CS 6293 Advanced Topics: Translational Bioinformatics

CS 6293 Advanced Topics: Translational Bioinformatics

Biological networks:Theory and applications

Page 2: CS 6293 Advanced Topics: Translational Bioinformatics

Lecture outline

• Basic terminology and concepts in networks

• Some interesting results between network properties and biological functions

• Network clustering / community discovery• Applications of network clustering methods

Page 3: CS 6293 Advanced Topics: Translational Bioinformatics

Network

• A network refers to a graph• An useful concept in analyzing the

interactions of different components in a system

Page 4: CS 6293 Advanced Topics: Translational Bioinformatics

Biological networks• An abstract of the complex relationships among

molecules in the cell• Many types.

– Protein-protein interaction networks– Protein-DNA(RNA) interaction networks– Genetic interaction network– Metabolic network– Signal transduction networks– (real) neural networks – Many others

• In some networks, edges have more precise meaning. In some others, meaning of edges is obscure

Page 5: CS 6293 Advanced Topics: Translational Bioinformatics

Protein-protein interaction networks

• Yeast PPI network• Nodes – proteins• Edges – interactions

The color of a node indicates the phenotypic effect of removing the corresponding protein (red = lethal, green = non-lethal, orange = slow growth, yellow = unknown).

Page 6: CS 6293 Advanced Topics: Translational Bioinformatics

Obtaining biological networks• Direct experimental methods

– Protein-protein interaction networks• Yeast-2-hybrid• Tandem affinity purification• Co-immunoprecipitation

– Protein-DNA interaction• Chromatin Immunoprecipitation (followed by microarray or

sequencing, ChIP-chip, ChIP-seq)– High level of noises (false-positive and false-negative)

• Computational prediction methods– Often cannot differentiate direct and indirect

interactions

Page 7: CS 6293 Advanced Topics: Translational Bioinformatics

Why networks?• Studying genes/proteins on the network level

allows us to:– Assess the role of individual genes/proteins in the

overall pathway– Evaluate redundancy of network components– Identify candidate genes involved in genetic diseases– Sets up the framework for mathematical models

For complex systems, the actual output may not be predictable by looking at only individual components:

The whole is greater than the sum of its parts

Page 8: CS 6293 Advanced Topics: Translational Bioinformatics

Graphs

• A graph G = (V, E)– V = set of vertices– E = set of edges = subset of V V– Thus |E| = O(|V|2)

1

2 4

3

Vertices: {1, 2, 3, 4}

Edges: {(1, 2), (2, 3), (1, 3), (4, 3)}

Page 9: CS 6293 Advanced Topics: Translational Bioinformatics

Graph Variations (1)

• Directed / undirected:– In an undirected graph:

• Edge (u,v) E implies edge (v,u) E• Road networks between cities

– In a directed graph:• Edge (u,v): uv does not imply vu• Street networks in downtown

– Degree of vertex v:• The number of edges adjacency to v• For directed graph, there are in-degree and out-degree

Page 10: CS 6293 Advanced Topics: Translational Bioinformatics

1

2 4

3

Directed

1

2 4

3

Undirected

Degree = 3In-degree = 3Out-degree = 0

Page 11: CS 6293 Advanced Topics: Translational Bioinformatics

Graph Variations (2)• Weighted / unweighted:

– In a weighted graph, each edge or vertex has an associated weight (numerical value)

• E.g., a road map: edges might be weighted w/ distance

1

2 4

3

1

2 4

3Unweighted Weighted

0.3

0.4

1.2

1.9

Page 12: CS 6293 Advanced Topics: Translational Bioinformatics

Graph Variations (3)

• Connected / disconnected:– A connected graph has a path from every

vertex to every other– A directed graph is strongly connected if there

is a directed path between any two vertices1

2 4

3

Connected but not strongly connected

Page 13: CS 6293 Advanced Topics: Translational Bioinformatics

Graph Variations (4)

• Dense / sparse:– Graphs are sparse when the number of edges is

linear to the number of vertices• |E| O(|V|)

– Graphs are dense when the number of edges is quadratic to the number of vertices

• |E| O(|V|2)

– Most graphs of interest are sparse– If you know you are dealing with dense or sparse

graphs, different data structures may make sense

Page 14: CS 6293 Advanced Topics: Translational Bioinformatics

Representing Graphs• Assume V = {1, 2, …, n}• An adjacency matrix represents the graph as a n

x n matrix A:– A[i, j] = 1 if edge (i, j) E

= 0 if edge (i, j) E• For weighted graph

– A[i, j] = wij if edge (i, j) E= 0 if edge (i, j) E

• For undirected graph– Matrix is symmetric: A[i, j] = A[j, i]

Page 15: CS 6293 Advanced Topics: Translational Bioinformatics

Graphs: Adjacency Matrix

• Example:

1

2 4

3

A 1 2 3 4

1

2

3 ??4

Page 16: CS 6293 Advanced Topics: Translational Bioinformatics

Graphs: Adjacency Matrix

• Example:

1

2 4

3

A 1 2 3 4

1 0 1 1 0

2 0 0 1 0

3 0 0 0 0

4 0 0 1 0

How much storage does the adjacency matrix require?A: O(V2)

Page 17: CS 6293 Advanced Topics: Translational Bioinformatics

Graphs: Adjacency Matrix

• Example:

1

2 4

3 4

3

2

0100

1011

0101

01101

4321A

Undirected graph

Page 18: CS 6293 Advanced Topics: Translational Bioinformatics

Graphs: Adjacency Matrix

• Example:

1

2 4

3

5

6

9 4

4

3

2

0400

4096

0905

06501

4321A

Weighted graph

Page 19: CS 6293 Advanced Topics: Translational Bioinformatics

Graphs: Adjacency Matrix

• Time to answer if there is an edge between vertex u and v: Θ(1)

• Memory required: Θ(n2) regardless of |E|– Usually too much storage for large graphs– But can be very efficient for small graphs

• Most large interesting graphs are sparse– E.g., road networks (due to limit on junctions)– For this reason the adjacency list is often a

more appropriate representation

Page 20: CS 6293 Advanced Topics: Translational Bioinformatics

Graphs: Adjacency List

• Adjacency list: for each vertex v V, store a list of vertices adjacent to v

• Example:– Adj[1] = {2,3}– Adj[2] = {3}– Adj[3] = {}– Adj[4] = {3}

• Variation: can also keep a list of edges coming into vertex

1

2 4

3

Page 21: CS 6293 Advanced Topics: Translational Bioinformatics

Graph representations• Adjacency list

1

2 4

3

2 3

3

3

How much storage does the adjacency list require?A: O(V+E)

Page 22: CS 6293 Advanced Topics: Translational Bioinformatics

Graph representations

• Undirected graph

1

2 4

3 432

010010110101011014321A

2 3

1

3

3

1 2 4

Page 23: CS 6293 Advanced Topics: Translational Bioinformatics

Graph representations

• Weighted graph

1

2 4

3

5

6

9 4 432

040040960905065014321A

2,5 3,6

1,5 3,9

3,4

1,6 2,9 4,4

Page 24: CS 6293 Advanced Topics: Translational Bioinformatics

Graphs: Adjacency List• How much storage is required?• For directed graphs

– |adj[v]| = out-degree(v)– Total # of items in adjacency lists is

out-degree(v) = |E|

• For undirected graphs– |adj[v]| = out-degree(v) – # items in adjacency lists is

degree(v) = 2 |E|

• So: Adjacency lists take (V+E) storage• Time needed to test if edge (u, v) E is O(n)

Page 25: CS 6293 Advanced Topics: Translational Bioinformatics

Tradeoffs between the two representations

Adj Matrix Adj Listtest (u, v) E Θ(1) O(n)Degree(u) Θ(n) O(n)Memory Θ(n2) Θ(n+m)Edge insertion Θ(1) Θ(1)Edge deletion Θ(1) O(n)Graph traversal Θ(n2) Θ(n+m)

|V| = n, |E| = m

Both representations are very useful and have different properties, although adjacency lists are probably better for most problems

Page 26: CS 6293 Advanced Topics: Translational Bioinformatics

Structural properties of networks

• Degree distribution• Average shortest path length• Clustering coefficient• Community structure• Degree correlation• Motivation to study structural properties:

– Structure determines function– Functional structural properties may be shared by

different types of real networks (bio or non-bio)

Page 27: CS 6293 Advanced Topics: Translational Bioinformatics

Degree distribution P(k)• The probability that a selected node has

exactly (or approximately) k links.– P(k) is obtained by counting the number of nodes

N(k) with k = 1, 2… links divided by the total number of nodes N.

Page 28: CS 6293 Advanced Topics: Translational Bioinformatics

Erdos-Renyi model

• Each pair of nodes have a probability p to form an edge

• Most nodes have about the same # of connections

• Degree distribution is binomial or Poisson

Page 29: CS 6293 Advanced Topics: Translational Bioinformatics

Real networks: scale-free

• Heavy tail distribution– Power-law distribution

• P(k) = k-r

0 10 20 30 40 50 600

20

40

60

80

100

Number of connections

Num

ber o

f gen

es

Page 30: CS 6293 Advanced Topics: Translational Bioinformatics

Comparing Random and Scale-free distribution

• In the random network, the five nodes with the most links (in red) are connected to only 27% of all nodes (green). In the scale-free network, the five most connected nodes (red) are connected to 60% of all nodes (green) (source: Nature)

Page 31: CS 6293 Advanced Topics: Translational Bioinformatics

Robust yet fragile nature of networks

Page 32: CS 6293 Advanced Topics: Translational Bioinformatics

Shortest and mean path length• Distance in networks is measured

with the path length• As there are many alternative paths

between two nodes, the shortest path between the selected nodes has a special role.

• In directed networks, – AB is often different from the BA– Often there is no direct path between

two nodes.• The average path length between all

pairs of nodes offers a measure of a network’s overall navigability.

• most pairs of vertices in a biological network seem to be connected by a short path – small-world property

Page 33: CS 6293 Advanced Topics: Translational Bioinformatics

Clustering coefficient

• Your clustering coefficient: the probability that two of your friends are also friends– You have m friends– Among your m friends, there are n pairs of

friends• The maximum is m * (m-1) / 2• C = 2 n / (m^2-m)

• Clustering coefficient of a network: the average clustering coefficient of all individuals

Page 34: CS 6293 Advanced Topics: Translational Bioinformatics

Clustering Coefficient

Ci=2Ei/ki(ki-1)=2/9

ith node has ki neighbors linking with it

Ei is the actual number of links between ki neighbors

maximal number of links between ki neighbors is ki(ki-1)/2

The probability that two of your friends are also friends

• Clustering coefficient of a network: average clustering coefficient of all nodes

Page 35: CS 6293 Advanced Topics: Translational Bioinformatics
Page 36: CS 6293 Advanced Topics: Translational Bioinformatics
Page 37: CS 6293 Advanced Topics: Translational Bioinformatics

Degree correlation

• Do rich people tend to hang together with rich people (rich-club)?

• Or do they tend to interact with less wealthy people?

• Do high degree nodes tend to connect to low degree nodes or high degree ones?

Page 38: CS 6293 Advanced Topics: Translational Bioinformatics

Some interesting findings from biological networks

• Jeong, Lethality and centrality in protein networks. Nature 411, 41-42 (3 May 2001)

• Roger Guimerà and Luís A. Nunes Amaral, Functional cartography of complex metabolic networks. Nature 433, 895-900 (24 February 2005)

• Han, et. al. Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature 430, 88-93 (1 July 2004)

Page 39: CS 6293 Advanced Topics: Translational Bioinformatics

Connectivity vs essentiality

Number of connections

% o

f ess

entia

l pro

tein

s

Jeong et. al. Nature 2001

Page 40: CS 6293 Advanced Topics: Translational Bioinformatics

Community role vs essentiality

• Effect of a perturbation cannot depend on the node’s degree only!

• Many hub genes are not essential• Some non-hub genes are essential• Maybe a gene’s role in her community is

also important– Local leader? Global leader? Ambassador?– Guimerà and Amaral, Nature 433, 2005

Page 41: CS 6293 Advanced Topics: Translational Bioinformatics

Community structure

Page 42: CS 6293 Advanced Topics: Translational Bioinformatics

• Role 1, 2, 3: non-hubs with increasing participation indices

• Role 5, 6: hubs with increasing participation indices

Page 43: CS 6293 Advanced Topics: Translational Bioinformatics

Dynamically organized modularity in the yeast PPI network

• Protein interaction networks are static• Two proteins cannot interact if one is not expressed• We should look at the gene expression level• Han, et. al, Nature 430, 2004

Page 44: CS 6293 Advanced Topics: Translational Bioinformatics

Obtaining Data

Page 45: CS 6293 Advanced Topics: Translational Bioinformatics

Distinguish party hubs from date hubs

Red curve – hubsCyan curve – nonhubsBlack curve – randomized• Partners of date hubs are significantly more diverse in spatial distribution

than partners of party hubs

Page 46: CS 6293 Advanced Topics: Translational Bioinformatics

Effect of removal of nodes on average geodesic distance

Green – nonhub nodesBrown – hubsRed – date hubsBlue – party hubsThe ‘breakdown point’ is the threshold after which the main component of the network starts disintegrating.

Original Network

On removal of date hubs

On removal of party hubs

Page 47: CS 6293 Advanced Topics: Translational Bioinformatics

Dynamically organized modularity

Red circles – Date hubsBlue squares - Modules

Page 48: CS 6293 Advanced Topics: Translational Bioinformatics

Han-Yu Chuang, Eunjung Lee, Yu-Tseung Liu, Doheon Han-Yu Chuang, Eunjung Lee, Yu-Tseung Liu, Doheon Lee, Trey Ideker, Lee, Trey Ideker, Network-based classification of breast cancer metastasis, Mol Syst Biol. 2007; 3: 140.Mol Syst Biol. 2007; 3: 140.

Page 49: CS 6293 Advanced Topics: Translational Bioinformatics

Challenge: Predict Metastasis• If metastasis is likely => aggressive If metastasis is likely => aggressive

adjuvant therapyadjuvant therapy– How to decide the likelihood? How to decide the likelihood?

• Traditional predictive factors are not goodTraditional predictive factors are not good

Page 50: CS 6293 Advanced Topics: Translational Bioinformatics

Recently: Gene Marker Sets• Examine genome-wide expression profilesExamine genome-wide expression profiles

– Score individual genes for how well they Score individual genes for how well they discriminate between different classes of diseasediscriminate between different classes of disease

• Establish gene expression signatureEstablish gene expression signature

– Problem: # genes >> # patientsProblem: # genes >> # patients

Page 51: CS 6293 Advanced Topics: Translational Bioinformatics

Pathway Expression vs. PPI Subnetwork as Marker

• Score known pathways for coherence of Score known pathways for coherence of gene expression changes?gene expression changes?– Majority of human genes not yet assigned to a Majority of human genes not yet assigned to a

definitive pathwaydefinitive pathway

• Large Protein-Protein Interaction networks Large Protein-Protein Interaction networks recently became availablerecently became available– Extract subnetworks from PPI networks as markersExtract subnetworks from PPI networks as markers

Page 52: CS 6293 Advanced Topics: Translational Bioinformatics

Subnetwork Marker Identification: Data Used

• 2 separate cohorts of breast cancer patients2 separate cohorts of breast cancer patients– van 't Veer et. al, and Wang et. al.van 't Veer et. al, and Wang et. al.– Roughly half had developed metastasisRoughly half had developed metastasis

• Used Protein-Protein Interaction network Used Protein-Protein Interaction network obtained by assembling a pooled dataset– 57,235 interactions among 11,203 proteins57,235 interactions among 11,203 proteins

Page 53: CS 6293 Advanced Topics: Translational Bioinformatics

Goal: Find Significantly Discriminative Subnetworks

• Use a scoring system to search for Use a scoring system to search for subnetworks highly discriminative of subnetworks highly discriminative of metastasismetastasis

Page 54: CS 6293 Advanced Topics: Translational Bioinformatics

Discriminative Score Function S

Page 55: CS 6293 Advanced Topics: Translational Bioinformatics

Step 1: Assign activity scores to a subnetwork of genes

Page 56: CS 6293 Advanced Topics: Translational Bioinformatics

Step 2: Assign discriminative score S to the subnetwork

• Score(subnetwork) = Mutual Information Score(subnetwork) = Mutual Information between a subnetwork’s activity score between a subnetwork’s activity score vector and phenotype vector over all vector and phenotype vector over all patientspatients– S(k) = MI (a,c)S(k) = MI (a,c)

Page 57: CS 6293 Advanced Topics: Translational Bioinformatics

Find Candidate Subnetworks using S and Greedy Search

• Use a single PPI node as seedUse a single PPI node as seed– At each iteration, add the neighbor At each iteration, add the neighbor

resulting in highest score improvement resulting in highest score improvement – Stop when no addition increases score Stop when no addition increases score

by rate r= .05, or distance from seed > 2by rate r= .05, or distance from seed > 2– Report candidate subnetwork and Report candidate subnetwork and

repeat with next node as seedrepeat with next node as seed

Page 58: CS 6293 Advanced Topics: Translational Bioinformatics

Identify Significant Subnets from 3 Null Distributions

• p1:100 expression perm. trials, p < 0.05p1:100 expression perm. trials, p < 0.05– Expression vectors of individual genes Expression vectors of individual genes

randomly permuted on the networkrandomly permuted on the network• p2: 100 random subnetworks seeded at p2: 100 random subnetworks seeded at

protein i, p < 0.05protein i, p < 0.05• p3: 20,000 phenotype perm. trials, p < p3: 20,000 phenotype perm. trials, p <

0.000050.00005

Page 59: CS 6293 Advanced Topics: Translational Bioinformatics

Results: Correspondence to hallmarks of cancer• For two datasets of 295 and For two datasets of 295 and

286 patients, 286 patients, 149149 and and 243243 (resp.) discriminative (resp.) discriminative subnets foundsubnets found

• 47% and 65% of subnets 47% and 65% of subnets enriched for common enriched for common biological processbiological process

• 66 and 153 subnets were 66 and 153 subnets were enriched for processes enriched for processes involved in major events of involved in major events of cancer progression cancer progression

Page 60: CS 6293 Advanced Topics: Translational Bioinformatics

Results: Reproducibility• Subnetwork markers significantly more Subnetwork markers significantly more

reproducible between datasets than individual reproducible between datasets than individual gene markersgene markers

Page 61: CS 6293 Advanced Topics: Translational Bioinformatics

Results: Reproducibility

Dataset 1 Dataset 2

Page 62: CS 6293 Advanced Topics: Translational Bioinformatics

Results: Reproducibility Shared network motifs with differences in Shared network motifs with differences in

differential expression differential expression Left-hand side is from Dataset 1 and right-hand Left-hand side is from Dataset 1 and right-hand

side is from Dataset 2side is from Dataset 2

Page 63: CS 6293 Advanced Topics: Translational Bioinformatics

Results: Subnetwork Markers as Classifiers Averaged expression values for each subnetwork Averaged expression values for each subnetwork

were used as features for a classifier based on were used as features for a classifier based on logistic regressionlogistic regression

For comparison, the top individual gene-markers For comparison, the top individual gene-markers were instead used as features were instead used as features

Markers from one dataset were used as predictors Markers from one dataset were used as predictors of metastasis on the other datasetof metastasis on the other dataset

Page 64: CS 6293 Advanced Topics: Translational Bioinformatics

Dataset 1 markers tested on Dataset 2, and Dataset 1 markers tested on Dataset 2, and vice versavice versa

Results: Subnetwork Markers as Classifiers

Page 65: CS 6293 Advanced Topics: Translational Bioinformatics

Results: Informative of Non-discriminative Disease Genes

Network analyses can identify proteins not Network analyses can identify proteins not differentially expressed, but required to connect differentially expressed, but required to connect higher scoring proteins in a significant subnetworkhigher scoring proteins in a significant subnetwork

85.9 and 96.7% of the significant subnetworks 85.9 and 96.7% of the significant subnetworks contained at least one protein that was not contained at least one protein that was not significantly differentially expressed in metastasissignificantly differentially expressed in metastasis

Page 66: CS 6293 Advanced Topics: Translational Bioinformatics

Results: Informative of Non-discriminative Disease Genes Several established prognostic markers were not Several established prognostic markers were not

present in individual gene expression markers, but present in individual gene expression markers, but played a central, interconnecting role in played a central, interconnecting role in discriminative subnetworksdiscriminative subnetworks MYC, ERBB2MYC, ERBB2

Page 67: CS 6293 Advanced Topics: Translational Bioinformatics

Community discovery: motivations

• Biological networks are modular– Metabolic pathways– Protein complexes– Transcriptional regulatory modules

• Provide a high-level overview of the networks

• Predict gene functions based on communities

Page 68: CS 6293 Advanced Topics: Translational Bioinformatics

Community discovery problem

• Divide a network into relatively densely connected sub-networks

Vertexreorder

Page 69: CS 6293 Advanced Topics: Translational Bioinformatics

Challenges

• How many communities?• Is there any community at all?

Page 70: CS 6293 Advanced Topics: Translational Bioinformatics

• Also known as modules • Relatively densely connected sub-network• Quite common in real networks

– Social networks– Internet– Biological networks– Transportation– Power grid

Community structures

Page 71: CS 6293 Advanced Topics: Translational Bioinformatics

Community discovery problem

• Divide a network into relatively densely connected sub-networks

Vertexreorder

Page 72: CS 6293 Advanced Topics: Translational Bioinformatics

History

• Social science: clustering– Based on affinities / similarities– Need to give # of clusters– Can always find clusters

• Computer science: graph partitioning– Minimizing cut / cut ratio– Need to give # of partitions– Can always produce partitions

• Preferred approach: natural division– Automatically determine # of communities– Do not partition if no community

Page 73: CS 6293 Advanced Topics: Translational Bioinformatics

Expected fraction of edges falling in community i

Observed fraction of edges falling in community i

Modularity function (Q)

• Measure strength of community structures– Newman, Phy Rev E, 2003

e11 e12

e21 e22 22212

12111

eeaeea

21 aaM

-1 < Q < 1Q = 0 if k = 1

Number of communities

k

i

iii

Ma

MeQ

1

2

)(

Page 74: CS 6293 Advanced Topics: Translational Bioinformatics

Q = 0.45

Q = 0.56

Q = 0

Q = 0.40 Q = 0.54

Goal: find the partition that has the highest Q valueBut: optimizing Q is NP-hard (Brandes et al., 2006)

Page 75: CS 6293 Advanced Topics: Translational Bioinformatics

Heuristic algorithms

• k-way spectral partitioning approximately optimizes Q if k is known– White & Smyth, SDM 2005

• k is unknown: test all possible k’s

1 2 3

5

10

15

20

25

301 2 3

5

10

15

20

25

30

eig kmeans

Page 76: CS 6293 Advanced Topics: Translational Bioinformatics

k-way spectral partitioning

Q = 0.56Q = 0.40 Q = 0.54

k = 2 k = 3 k = 4

• Good accuracy• ~O(n3) time complexity; n: # of vertices

Page 77: CS 6293 Advanced Topics: Translational Bioinformatics

Recursive bi-partitioning

Q = 0.56

Q = 0.40

Q = 0.54

x

• ~O(m logn) time complexity; m: # of edges• Accuracy worse than k-way partitioning

Page 78: CS 6293 Advanced Topics: Translational Bioinformatics

Can we do better?

• Objectives– Efficiency of the recursive algorithm– Accuracy of the k-way algorithm (or even better)

• Ideas– Flexible l-way recursive partition (l = 2-5)

• As efficient as recursive bi-partition• Accuracy similar to K-way algorithm• Ruan and Zhang, ICDM 2007

– Take the results of recursive algorithm as the starting point, do local improvement

• Ruan and Zhang, Physical Review E 2008

Page 79: CS 6293 Advanced Topics: Translational Bioinformatics

Algorithm Qcut

1. Recursive partitioning until local maximum of Q

2. Refine solution by greedy searchConsider two types of operations

• Move a vertex to a different community• Merge two communities

– Take the one with the largest improvement of Q– Repeat until no improvement of Q can be made– Go back to step 1 if necessary

• Key: quickly find out the operation that can give the largest improvement of Q

Page 80: CS 6293 Advanced Topics: Translational Bioinformatics

Identifying candidate moves

• If vertex v moves from community i to j

xi – degree of v in community ix – degree of vai – total degree for vertices in community i

2

)(

Mxxaa

Mxx

Q jiij

• Compute all potential Q from initial state• Update is almost constant for scale-free networks• Additional heuristics to improve efficiency

Page 81: CS 6293 Advanced Topics: Translational Bioinformatics

Results on synthetic networks

• Relative Q = Qfound − Qtrue

N_out

Rela

tive

Q

N_out

Accu

racy

• State of the art: Newman, PNAS 2006

Page 82: CS 6293 Advanced Topics: Translational Bioinformatics

An exampleReal Structure Vertex reordered

Result of Qcut (Accuracy: 99%) Result of Newman (Accuracy: 77%)

Page 83: CS 6293 Advanced Topics: Translational Bioinformatics

Results on real-world networks

#Vertices #EdgesModularity

Newman SA QcutSocial 67 142 0.573 0.608 0.587Neuron 297 2359 0.396 0.408 0.398Ecoli Reg 418 519 0.766 0.752 0.776Circuit 512 819 0.804 0.670 0.815Yeast Reg 688 1079 0.759 0.740 0.766Ecoli PPI 1440 5871 0.367 0.387 0.387Internet 3015 5156 0.611 0.624 0.632Physicists 27519 116181 -- -- 0.744

SA: Simulated annealing, Guimera & Amaral, Nature 2005

Page 84: CS 6293 Advanced Topics: Translational Bioinformatics

Running time (seconds)

#vertices #EdgesRunning time

Newman SA QcutSocial 67 142 0.0 5.4 2.0

Neuron 297 2359 0.4 139 1.9Ecoli Reg 418 519 0.7 147 12.7

Circuit 512 819 1.8 143 6.1Yeast Reg 688 1079 3.0 1350 13.4Ecoli PPI 1440 5871 33.2 5868 41.5Internet 3015 5156 253.7 11040 43.0

Physicists 27519 116181 -- -- 2852

Page 85: CS 6293 Advanced Topics: Translational Bioinformatics

Graphical user interface for biologists

Page 86: CS 6293 Advanced Topics: Translational Bioinformatics

A real-world example• A classic social network: Karate club• Node – club member; edge – friendship• Club was split due to a dispute• Can we predict the split given the network?

Page 87: CS 6293 Advanced Topics: Translational Bioinformatics

Network of football teams

• Vertices: football teams in NCAA Division I-A

• Edges: games played in year 2000

• 110 teams• 11 conferences

(excluding independents)• Most games are within

conferences

Big 12 Big East

Page 88: CS 6293 Advanced Topics: Translational Bioinformatics

Conference vs. CommunityConferences

Communities discoveredby Qcut / Newman

Mountain West Pacific Ten

Page 89: CS 6293 Advanced Topics: Translational Bioinformatics

Whose fault is it?Communities discovered

by Qcut / Newman

Q = 0.6239

Force the two conferences to be separated:

Q = 0.6237

Page 90: CS 6293 Advanced Topics: Translational Bioinformatics

Resolution limit of the Q function

• C1 and C2 separable only if Q2 – Q1 > 0• Q2 – Q1 a1a2/2M – e12

– a1a2/2M: expected # of edges between C1 and C2– e12: actual # of edges between c1 and c2

• If C1 and C2 are small relative to the network– Expected # edges < 1– C1 and C2 non-separable even if connected by one edge– But the edge may be due to noise in data

Q2

Large network

c1c2

Large network

Q1

c1 c2

Page 91: CS 6293 Advanced Topics: Translational Bioinformatics

Resolution limit• Optimizing Q

– may miss small communities– is sensitive to false-positive edges– cannot reveal hierarchical structures

• A community containing some sub-communities

• Real-world networks– contain both large and small communities– may have false positive edges

• Biological data are extremely noisy

– have hierarchies

Page 92: CS 6293 Advanced Topics: Translational Bioinformatics

A solution: HQcut

• Ruan & Zhang, Physical Review E 2008• Apply Qcut to get communities with largest Q• Recursively search for sub-communities

within each community• When to stop?

– Q value of sub-network is small, or– Q is not statistically significant

• Estimated by Monte-Carlo method

Page 93: CS 6293 Advanced Topics: Translational Bioinformatics

Q = 0.49

Randomize

Z-score = (0.49 - 0.15) / 0.016 = 21

Randomize

Q = 0.18Z-score = (0.18 - 0.15) / 0.016 = 1.9

Randomize

Q = 0.49randQ = 0.52 0.031

Z-score = (0.49 - 0. 52) / 0.031 = -1.3

randQ = 0.15 0.016

randQ = 0.15 0.016

Page 94: CS 6293 Advanced Topics: Translational Bioinformatics

Large network

Q = 0.49Z-score = -1.3

Q = 0.49Z-score = 21

Q = 0.18Z-score = 1.9

Page 95: CS 6293 Advanced Topics: Translational Bioinformatics

Test on synthetic networks• Network: 1000 vertices• Community sizes vary from 15 to 100

Page 96: CS 6293 Advanced Topics: Translational Bioinformatics

Accuracy

Page 97: CS 6293 Advanced Topics: Translational Bioinformatics

Discovered by Qcut Discovered by HQcut

Example communities

Page 98: CS 6293 Advanced Topics: Translational Bioinformatics

Results for the NCAA teamsCommunities by Qcut/Newman Communities by HQcut

Mountain West Pacific Ten

Page 99: CS 6293 Advanced Topics: Translational Bioinformatics

Applications to a PPI network

• Protein-protein interaction (PPI) network– Vertices: proteins– Edges: interactions detected by experiments

• Motivation:– Community = protein complex?

• Protein complex– Group of proteins associated via interactions– Elementary functional unit in the cell– Prediction from PPI network is important

Page 100: CS 6293 Advanced Topics: Translational Bioinformatics

Experiments

• Data set– A yeast protein-protein interaction network

• Krogan et.al., Nature. 2006– 2708 proteins, 7123 interactions

• Algorithms:– Qcut, HQcut, Newman

• Evaluation– ~300 Known protein complexes in MIPS– How well does a community match to a known protein

complex?

Page 101: CS 6293 Advanced Topics: Translational Bioinformatics

ResultsNewman Qcut HQcut

# of communities 56 93 316

Max community size 312 264 60

# of matched communities 53 52 216

Communities with matching score = 1 5 (9%) 7 (13%) 43 (20%)

Average matching score 0.56 0.55 0.70

# of novel predictions 3 41 100

Page 102: CS 6293 Advanced Topics: Translational Bioinformatics

Communities found by HQcutSmall ribosomal subunit (90%)

RNA poly II mediator (83%)

Proteasome core (90%)

Exosome (94%)

gamma-tubulin (77%)

respiratory chain complex IV (82%)

Page 103: CS 6293 Advanced Topics: Translational Bioinformatics

Example hierarchical community

Page 104: CS 6293 Advanced Topics: Translational Bioinformatics

Microarray data• Data organized into a matrix

– Rows are genes– Columns are samples representing different

time points, conditions, tissues, etc.• Analysis techniques

– Differential expression analysis– Classification and clustering– Regulatory network construction– Enrichment analysis

• Characteristics of microarray data– High dimensionality and noise– Underlying topology unknown, often

irregular shape

Sample

Gen

e

Red: high activityGreen: low activity

Page 105: CS 6293 Advanced Topics: Translational Bioinformatics

Microarray data clustering

• Many clustering algorithms available– K-means– Hierarchical– Self organizing maps– Parameter hard to tune– Does not consider network topology

Sample

Gen

e • Common functions?• Common regulation?• Predict functions for

unknown genes?

Analyze genes in each cluster

Red: high activityGreen: low activity

Page 106: CS 6293 Advanced Topics: Translational Bioinformatics

Network-based data analysis

• Genes i and j connected if their expression patterns are “sufficiently similar”– Similarity > threshold

• Long list of references– K nearest neighbors

• Recently became popular• Many interesting applications beyond clustering• Focus here is clustering

Gen

e

SampleConstructCo-expression network

ij

=

Page 107: CS 6293 Advanced Topics: Translational Bioinformatics

Motivation

• Can we use the idea of community finding for clustering microarray data?

• Advantages: – Parameter free– Network topology considered– Constructed network may have other uses

Page 108: CS 6293 Advanced Topics: Translational Bioinformatics

Network-based microarray data analysis

• How to get the networks?– Threshold-based– Nearest neighbors

• Can we use a complete weight matrix?– Complete graph, with weighted edges– In general, no, since Q is ill-defined on weighted networks

Gen

e

SampleConstructCo-expression network

ij

How to determine the right cutoff?

=

Page 109: CS 6293 Advanced Topics: Translational Bioinformatics

Network-based microarray data analysis

• There is an implicit network structure

• Motivation: true network should be naturally modular– Can be measured by modularity (Q)– If constructed right, should have the highest Q

Clustering

gene

Condition

Page 110: CS 6293 Advanced Topics: Translational Bioinformatics

Method overview

……

Net_1,Most dense

Net_m,Most sparse

Microarraydata

Similaritymatrix

Network series

Qcut

Qcut

Page 111: CS 6293 Advanced Topics: Translational Bioinformatics

Method overview (cont’d)

Network density

Mod

ular

ity

Random network

True network

Difference

• Therefore, use ∆Q to determine the best network parameter and obtain the best community structure

• We actually run HQcut, a variant of Qcut, in order to avoid resolution limit (Ruan & Zhang, Phys Rev E 2008)

Page 112: CS 6293 Advanced Topics: Translational Bioinformatics

Network construction methods

• Value-based method– Remove edges with similarities < ε.– Fixed ε for all vertices– May have problem detecting weakly correlated modules

• Asymmetric k-nearest neighbors (aKNN)– Connect each vertex to k other vertices– Fixed k for all vertices (k < 10 good enough)– Minimum degree = k. max = ?– Sensitive to outliers

• Mutual k-nearest-neighbors (mKNN)– Association confirmed by both ends– Maximum degree = k, min = 0. (k larger than in aKNN.)– Outlier can be detected.– Ruan, ICDM 2009

Page 113: CS 6293 Advanced Topics: Translational Bioinformatics

Results: synthetic data set 1

• High dimensional data generated by synDeca. – 20 clusters of high dimensional points, plus some

scatter points– Clusters are of various shapes: eclipse, rectangle,

random

10 20 30 40 50 60 70 80 90 100

100

200

300

400

500

600

700

800

900

1000 0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of neighbors

QReal

QRandom

Qreal - Qrandom

Clustering Accuracy

∆ Q

Accuracy

Page 114: CS 6293 Advanced Topics: Translational Bioinformatics

Comparison

10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

Dimension

Clu

ster

ing

Acc

urac

y

This workkmeansoptimal knnHQcut

mKNN-HQcut with the optimum k

mKNN-HQcut with automatically determined k

Page 115: CS 6293 Advanced Topics: Translational Bioinformatics

Results: synthetic data set 2

• Gene expression data– Thalamuthu et al, 2006– 600 data sets– ~600 genes, 50

conditions, 15 clusters– 0 or 1x outliers

Without outliers With outliersmKNN-HQcutWith optimal k

mKNN-HQcutWith auto k

Page 116: CS 6293 Advanced Topics: Translational Bioinformatics

Comparison with other methods

Page 117: CS 6293 Advanced Topics: Translational Bioinformatics

Results on yeast stress response data

• 3000 genes, 173 samplesBest k = 140. Resulting in 75 clusters

Page 118: CS 6293 Advanced Topics: Translational Bioinformatics

Results on yeast stress response data

• Enrichment of common functions– Accumulative hyper-geometric test

GO Function Terms

Gene

Protein biosynthesis (p < 10-96)

Nuclear transport (p < 10-50)

mt ribosome (p < 10-63)

DNA repair (p < 10-66)RNA splicing (p < 10-105)Nitrogen compound metabolism (p < 10-37)

Peroxisome (p < 10-13)

Page 119: CS 6293 Advanced Topics: Translational Bioinformatics

Comparison with k-means

K-means

mkNN-HQcutUsing automatically determined k = 140

Over

all f

unct

ion

cohe

renc

e

Page 120: CS 6293 Advanced Topics: Translational Bioinformatics

Application to Arabidopsis data• ~22000 genes, 1138

samples• 1150 singletons• 800 (300) modules of

size >= 10 (20)• > 80% (90%) of

modules have enriched functions

• Much more significant than all five existing studies on the same data set

Top 40 most significant modules

Page 121: CS 6293 Advanced Topics: Translational Bioinformatics
Page 122: CS 6293 Advanced Topics: Translational Bioinformatics

Cis-regulatory network of Arabidopsis

MotifModule

Page 123: CS 6293 Advanced Topics: Translational Bioinformatics

Beyond gene clusters (1)

• Gene specific studies– Collaborator is interested in Gibberellins – A hormone important for the growth and development of

plant– Commercially important– Biosynthesis and signaling well studied– Transcriptional regulation of biosynthesis and signaling

not yet clear– 3 important gene families, GA20ox, GA3ox and GA2ox

for biosynthesis– Receptor gene family: GID1A,B,C– Analyze the co-expression network around these genes

Page 124: CS 6293 Advanced Topics: Translational Bioinformatics

GID1A

GID1B

GID1C

20ox1

20ox220ox3 20ox4

20ox5

3ox1

3ox23ox3

3ox4

2ox1

2ox2

2ox3

2ox42ox6

2ox7

2ox8

GA3

20ox

3ox

2ox

Page 125: CS 6293 Advanced Topics: Translational Bioinformatics
Page 126: CS 6293 Advanced Topics: Translational Bioinformatics

Beyond gene clusters (2)• Cancer classification

Gene

Sam

ple

Sample

Alizadeh et. al. Nature, 2000

Sample: tumor/normal cells

Qcut

Page 127: CS 6293 Advanced Topics: Translational Bioinformatics

ActivatedBlood B

Chronic lymphocytic leukemia (CLL)

Follicular lymphoma (FL)

Blood T

Transformed cell lines

Diffuse large B-cell Lymphoma(DLBCL)

Resting Blood B

DLBCL

DLBCL

Network of cell samplesBlack: normal cellsBlue: tumor cells

Page 128: CS 6293 Advanced Topics: Translational Bioinformatics

Survival rate after chemotherapy

DLBCL-1DLBCL-2

DLBCL-3

Survival rate: 73%Median survival time: 71.3 months

Survival rate: 40%Median survival time: 22.3 months

Survival rate: 20%Median survival time: 12.5 months

Page 129: CS 6293 Advanced Topics: Translational Bioinformatics

Beyond gene clustering (3)• Topology vs function

Number of connections

% o

f ess

entia

l pro

tein

s

Jeong et. al. Nature 2001

Page 130: CS 6293 Advanced Topics: Translational Bioinformatics

Community participation vs. essentiality

• Key: how to systematically search for such relationships?

Community participation

% E

ssen

tial

% E

ssen

tial

Number of connections

Non-hub

HubParticipation < 0.2

Participation >= 0.2