Computational Systems Biology Lectures 6 and 7: Reconstruction and structural analysis ... ·...
Transcript of Computational Systems Biology Lectures 6 and 7: Reconstruction and structural analysis ... ·...
Computational Systems Biology
1
Computational Systems Biology
Lectures 6 and 7: Reconstruction and structural analysis of
biological networks
Hongwu MaComputational Systems Biology Group
10 February 2006
Computational Systems Biology
2
Simple models
1210321 XSSX EEE ⎯→⎯⎯→⎯⎯→⎯
An enzyme reaction Enzyme kinetics
A linear pathway
BA E⎯→⎯
Metabolic control analysis
Computational Systems Biology
3
The reality
Glycolysis pathway The whole network
Computational Systems Biology
4
Systems Biology comes of age
0
100
200
300
400
1997 1999 2001 2003 2005
SB institute in seattle
1st ICSB in Japan
Review paper by L hood
Special issues in Nature and Science
1996, Systems biology unit, Glaxo WellcomeMedicines Research Centre
No. of Pubm
edpapers
Computational Systems Biology
5
The background
• Fully sequenced genomes for several hundreds organism
• High-throughput methods for measurement at whole system level (more data with less experiments)
• Various internet available databases make it easy to collect and integrate information at system level.
Computational Systems Biology
6
But how to study SB
• Methods for constructing the system (network) from available information
• Methods for theoretical analysis of such complex systems
• How “system analysis” contribute to biology
Computational Systems Biology
7
Reconstruction of metabolic networks
• Organism specific MN from its genome• High throughput reconstruction from
databases• High quality network: Human curation
Computational Systems Biology
8
Metabolic network reconstruction
Genes
b0001b0002b0003……..
Enzymes
EC 1.1.1.3EC 1.1.1.6EC 1.1.1.17
………...
Reactions
D-Glucose + ATP = D-G6P + ADPD-G6P = D-F6P
………………………
Computational Systems Biology
9
Reconstruction from genome
Database
Gene annotation
Network
CTGAGGTCGACTCTAGAGGATCCCCCTTCCAGATGTGTAAGTTA
TTTGAGTGAAATAGTGCCAGTTTAACCATAGTTCTAGTAAGCT
TTTGAGTGAAATAGTGCCAGTTTAACCATAGTTCTAGTAAGCTG
Genome sequence
ORF
Gene recognition
Sequence similarity search, Blast
ID Name start end functionEG11277 thrL 190 255 Regulatory leader peptide for thrABC operoEG10998 thrA 337 2799 Aspartokinase I-homoserine dehydrogenasEG10999 thrB 2801 3733 Homoserine kinase [EC 2.7.1.39]EG11000 thrC 3734 5020 Threonine synthase [EC 4.2.3.1]
Computational Systems Biology
10
High throughput reconstruction from enzyme databases
http://www.genome.ad.jp/dbget-bin/www_bget?enzyme+1.1.1.81ENTRY EC 1.1.1.81 NAME Hydroxypyruvate reductase CLASS Oxidoreductases Acting on the CH-OH group of donors With NAD+ or NADP+ as acceptor SYSNAME D-Glycerate:NADP+ 2-oxidoreductase REACTION D-Glycerate + NAD+ or NADP+ = Hydroxypyruvate + NADH or NADPH PATHWAY PATH: MAP00260 Glycine, serine and threonine metabolism PATH: MAP00630 Glyoxylate and dicarboxylate metabolism GENES YPE: YPO2536 PAE: PA1499 RSO: RS03094(ttuD1) RS05749(ttuD2) MLO: mlr5146 SME: SMa1406(ttuD3) SMb20678(ttuD2) SMc04389(ttuD1) ATU: Atu3232 Atu5334
Computational Systems Biology
11
Useful databases• KEGG: http://www.genome.ad.jp/kegg/• Biocyc: http://biocyc.org/• Expasy: http://www.expasy.ch• Brenda: http://www.brenda.uni-koeln.de/
Ability required: writing program to extract information from websites or downloadable text files.
Computational Systems Biology
12
From Enzyme to reaction
For EC 5.3.1.9:
4 reactions in KEGG LIGAND database
2 reactions in WIT database
1 reaction in IUBMB and Expasy databases
1 reaction in BRENDA database
1 reaction in EcoCyc
An enzyme may have wide substrate catalysis activity. E.g. 1.1.1.2: an alcohol + NADP+ = an aldehyde + NADPH + H+
Computational Systems Biology
13
KEGG LIGAND reaction databaseMore than 6000 reactions
………
ENTRY R02739
NAME alpha-D-Glucose 6-phosphate ketol-isomerase
DEFINITION alpha-D-Glucose 6-phosphate <=> beta-D-Glucose 6-phosphate
EQUATION C00668 <=> C01172
ENZYME 5.3.1.9 5.1.3.15
………
irreversibility
Computational Systems Biology
14
E. Coli Metabolic network from KEGG
Reaction
Metabolite
638 Enzymes1081 reactions
Ma & Zeng (2003) Bioinformatics. 19, 270
Computational Systems Biology
15
Problems in the high throughput methods
•Nonenzyme catalyzed reactions link
•Uncharacted enzymes like 1.-.-.-
•Unsequensed enzymes (gene for a known enzyme not identified, more than 40 in E. coli) link
•Wrongly annotated enzyme genes (the annotations in KEGG, Ecocyc and Expasy for E. coli are very different)
Computational Systems Biology
16
Enzyme databases
Annotated genome
Enzymes
EC 1.1.1.3EC 1.1.1.6EC 1.1.1.17
………...
Reactions
R0004R0007R0100
………...Metabolic network
High quality network reconstruction
KEGGExpasyBrenda
Human curation
Computational Systems Biology
17
Available high quality networks
E. coli: Ecocyc, Palsson’s group
Yeast: Palsson and Nielsen’s group
Helicobacter pylori
Haemophilus influenzae
Human?
link
Computational Systems Biology
18
Structural analysis of metabolic networks
• Graph representation• Structure properties• Network decomposition• Software for network analysis
Computational Systems Biology
19
Mathematical representation of MN
⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜
⎝
⎛
0010000110110100110100010
132313
6
PGPTPT
FDPPF
Glyceraldehyde 3-P
Fructose 1,6-P
Fructose 6-PH2O, Pi
1,3-P Glycerate
NAD, Pi
NADH
ATP
ADP
Glycerone P
r1 r2
r3
r4r5
T3P2
F6P
FDP
T3P1
⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜
⎝
⎛
−−−−
−−
−−
0111001010000000001100000000011101100000001100001100011
5
4
3
2
1
rrrrr
13PG
ATP
ADP
NAD
H
NAD
Pi
H
2O
Stoichiometric matrix Connectivity matrix
⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜
⎝
⎛
0110010100110100000100110
5
4
3
2
1
rrrrr
Computational Systems Biology
20
Connectivity matrix to graph
r1 r5
r4
r2
r3
T3P2
F6P
13PG
FDP
T3P1
Metabolite graph
Reaction graph
⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜
⎝
⎛
0010000110110100110100010
132313
6
PGPTPT
FDPPF
Connectivity matrix
⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜
⎝
⎛
0110010100110100000100110
5
4
3
2
1
rrrrr
Computational Systems Biology
21
Currency metabolites in graph analysis
Otherwise path length will be 2 instead of 9 (Jeong et al. 2000 Nature 407:651)
G l u c o s e
G l u c o s e 6 - P
G l y c e r a l d e h y d e 3 - P
F r u c t o s e 1 , 6 - P
F r u c t o s e 6 - P
2 - P G l y c e r a t e
P h o s p h o e n o l p y r u v a t e
A T P
A D P
H 2 O , P i
3 - P G l y c e r a t e
1 , 3 - P G l y c e r a t e
H 2 O
N A D , P i
N A D H
P y r u v a t e
A D P
A T P
A T PA D P
G l y c e r o n e P
A D PA T P
From glucose to pyruvate, ADP can not be used as a link.
Currency metabolites
Computational Systems Biology
22
With or without currency metabolites
Metabolic network of S. pneumonia (616 reactions)
Computational Systems Biology
23
Structural properties of large scale network
• Connection degree distribution• Path length (efficiency)• Node centrality (the most important nodes)• Connectivity of the network• Modules in the network• Network motifs
Computational Systems Biology
24
Connection degree distribution
0. 001
0. 01
0. 1
1
1 10 100K
P(K)
out puti nput
γ−= akkP )(
P(k): Percentage of nodes with a connection degree k or not less than k (Cumulative distribution).
Power law degree distribution indicates a scale free network
The few highly connected nodes are called hubs in the network
Computational Systems Biology
25
Hub metabolites
Glycerate-3-phosphate, D-Ribose-5-phosphate, Acetyl-CoA, Pyruvate, D-Xylulose 5-phosphateD-Fructose 6-phosphate, 5-Phospho-D-ribose 1-diphosphate, L-Glutamate, D-Glyceraldehyde 3-phosphate, L-Aspartate, Propanoyl-CoA, Malonyl-ACP, Succinate, Acetate, Isocitrate, Fumarate
Computational Systems Biology
26
Average path lengthPath length: number of the steps in the shortest paths from one node to anotherAPL: average of the path lengths for all connected pairs of nodes
Breadth first searching method to find the shortest paths from one node to all other nodes
Computational Systems Biology
27
Average path length in MNJeong et al. (2000) Nature: constant AL (about 3) for all the 43 organisms studied.
02468
1012
0 200 400 600 800 1000
Node number
Aver
age
path
leng
th
EukaryoteBacteriaArchaea
Structural differences among MNs of the three domains of organism
Ma & Zeng (2003) Bioinformatics. 19, 270
Computational Systems Biology
28
Node Centrality
dyxdnxC
xyUy
1),(
1)( ,
=−
=∑ ≠∈
Closeness centrality of node x:
d(x,y) the path length between node x and node yU the set of all nodes
average path length between x and the other nodesd
The central nodes have short path lengths to other nodes in the network
Computational Systems Biology
29
The most central metabolites in the metabolic network of E. coli
5.06G3P
5.03Acetaldehyde
4.98Acetate
4.932KD6PG
4.89Malate
4.76Actyl-CoA
4.44Pyruvate
Mean distanceMetabolite
nodes connectedhighly nodes centralmost ≠
Computational Systems Biology
30
Network Centrality
)2)(1())(()32( *
−−
−−= ∑ ∈
nnxCCn
C Ux
n node number in the networkC(x) the closeness centrality for node xC* the highest value of node closeness centrality
circle network star network
C = 0 C = 1
Network closeness centralization index
Distribution of the node centrality in the network
Computational Systems Biology
31
Average path length vs. network closeness centralization index
0
2
4
6
8
10
12
0 0.05 0.1 0.15 0.2 0.25
OCCI
ALG
EukaryoteBacteriaArchaea
C
Computational Systems Biology
32
Network Connectivity
Degree distribution tell nothing about global connectivity
The right network can have short average path length though not connected at all
Computational Systems Biology
33
Strongly or weakly connected components
Fully connected subsets depending on if direction is considered
Computational Systems Biology
34
SC in metabolic network
72
15
5 3 1 2 10
10
20
30
40
50
60
70
80
2 3 4 5 6 7 8 9 10 11 274
Size of components
Num
ber
of c
ompo
nent
s
Num
ber
GSC: Giant strong component
One big SC and many small SCs
Computational Systems Biology
35
Connectivity structure of MN
Giant strong component (GSC)metabolites fully converted and
convertible to each other
Substrate subset (93)converted to metabolite in GSC
Product subset (161)produced from metabolites in GSC
274 out of total 811 metabolites
Isolated subset (283)
Computational Systems Biology
36
Bow-tie: a general structure of biological and physical networks
Giant Strong Component
disconnected components
OUTsubset
INsubset
• Metabolic network• Signal transduction• Web page• Material processing and other tech. systems
Computational Systems Biology
37
The average path length of the GSC determines that of the whole network
0
4
8
12
0 2 4 6 8 10 12
ALW
ALG
Path length of whole network
Path
leng
th o
f GSC
Computational Systems Biology
38
Other structure parameters
• Cluster coefficient• Betweeness centrality• Reachability• Clique and core
Computational Systems Biology
39
Pajek, the right software• Written in C, very fast, many functions, 1M• Free and updated with new function frequently• good manual, theory introduction with how-to-do in
the software• Book available: Exploratory Social Network
Analysis with Pajek• http://vlado.fmf.uni-lj.si/pub/networks/pajek/ h
Computational Systems Biology
40
Other tools
• Cytoscape http://www.cytoscape.org/• Ucinet
http://www.analytictech.com/ucinet.htm• Bioconductor and R• Matlab
Computational Systems Biology
41
Network decomposition
• Identifying a somehow independent subnetwork (modules) from such complex networks for biological function analysis or developing more detailed kinetic model
Computational Systems Biology
42
WHYUniversity of Edinburgh
CSE CMVM CHSS
S1 S2 S3 S1 S2 S3 S1 S2 S3
Computational Systems Biology
43
A Method• Define a distance between two nodes: the shortest path
between node A and node B• Distance matrix a hierarchical classification tree
Computational Systems Biology
44
Hierarchical decomposition of GSC
6
4
3
2
1
5
1
4
3
2
5
5
Computational Systems Biology
45
Biological function of these modules
1. Aspartate metabolism, Thr, Lys synthesis2. Glutathione and THF metabolism 3. Glutamate metabolism, Arg, Pro synthesis4. galacterate, glycerate metabolism, Ser synthesis5. PP pathway, upper part of glycolysis pathway,
sugar, aminosugar and glycerol metabolism6. TCA cycle and glyoxylate cycle, pyruvate
metabolism, AcCoA, AcAcCoA metabolism
Modules identified by the decomposition method have true biological meaning
Computational Systems Biology
46
Transcriptional regulation: biological basis
• Transcription unit: operon• Sigma factor• Transcription factor (TF)• Promotor• TF binding site• More complex in Eukaryotes
Transcription unit
RNA polymerase
Sigma factor
Promotor
TFs
Binding site Genes
Interactions between proteins and genes
Computational Systems Biology
47
Reconstruction of MN and TRN
E. coli
P. aeruginosa
A1 B1
A2 B2
1.2.1.3
mA
mB
TF
+
+?
Are the orthologous genes in different organisms regulated in similar ways?
Computational Systems Biology
48
Transcriptional regulatory networks
• Can not be reconstructed directly from gene annotation information
• data is mainly from database which collect information from literatures (for E. coli and B. subtilis)
• ChIP on chip technology for high throughput reconstruction (mainly yeast network)
Computational Systems Biology
49
Available databases
• RegulonDB(http://www.cifn.unam.mx/Computational_Genomics/regulondb/)
• Ecocyc (ecocyc.org)• DBTBS (B. subtilis) (dbtbs.hgc.jp)• Prodoric Net (Prokaryote) (prodoric.tu-bs.de)• TRANSFAC (Eukaryotes)
(www.biobase.de/pages/products/transfac.html)
Computational Systems Biology
50
E. coli regulatory network1278 genes 2724 interactions157 genes for TFs382 metabolic enzyme genes (Blue)
Ma et al. (2004) Nucleic Acid Research 32, 6643
Computational Systems Biology
51
Connectivity analysisA giant weakly connected component with 1166 nodes
No strongly connected component exists, indicating no multiple gene regulatory loop exist, impling a somehow acyclic structure.
Computational Systems Biology
52
Method to find regulatory hierarchy
Computational Systems Biology
53
The nine-layer hierarchical structure
Average path length: 1.85
rpoE
fis
soxR
cspA
rpoS
phoB
crp
ihf
argP
cspE
rpoNdnaA
hnscytR
Like the organization of the university?
Computational Systems Biology
54
Network motifSubgraphs that appear in the network at frequencies much higher than those found in randomized networks (Shen-Oorret al., Nature Genetics, 31:64).
Regarded as elementary building blocks of network
Milo et al. Network motifs: simple building blocks of complex networks.Science. 2002, 298(5594):824-7.
Computational Systems Biology
55
Network motifs in regulatory network
Feed forward loops
712 in the network
Bi-fan motif
11935
Computational Systems Biology
56
Motifs in the network
Blue: FFLs
Yellow: Bi-fan motifs
Green: Both
Computational Systems Biology
57
Two kinds of FFLs
+
+
++
−
+
Coherent FFLs Incoherent FFL
Direct effect of the upper regulator A on the target gene C is consistent with its indirect effect through B.
Direct effect is inconsistent with the indirect effect.
C
BB
AA
C
Mangan and Alon, (2003) PNAS, 100:11980
Computational Systems Biology
58
Motif distributionCoherent FFLs: 330
+
−
++
−
−−
−
−−
+−
+
++
+
+−
−
++type
number 265 28 7 30 119 9 8 16
examples
flhC-fliA-
fliGH
cpxR-csgD-csgA
fis-hns-cysG
fnr-narL-dcuB
crp-nagC-manX
YZ
arcA-betI-
betAB
ihf-flhD-nrfA
fnr-narL-
moeAB
−
−
+
Incoherent FFLs: 152
Computational Systems Biology
59
Motifs often interact with each other
hns
gadW
gadA
gadX
crprpoS
gadA: glutamate decarboxylase (in aminobutarate pathway, a by path of TCA cycle), important in oxidative stress and acid stress response
Computational Systems Biology
60
references
H.W. Ma and A.P. Zeng. (2003) Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics. 19(2)270-277. H.W. Ma and A.P. Zeng. (2003) The connectivity structure, giant strong component and centrality of metabolic networks. Bioinformatics. 19(11)1423-1430. H.W. Ma, et al (2004) An extended transcriptional regulatory network of Escherichia coli and its structural analysis, Nucleic Acid Research, 32:6643Barabási and Oltvai (2004) Network Biology: Understandingthe cell‘s functional organization. Nature Rev. Genet. 5, 101-114.