Cytoscape An open-source software platform for the...
Transcript of Cytoscape An open-source software platform for the...
Benno Schwikowski
CytoscapeAn open-source software platform
for the explorationof molecular interaction networks
Benno Schwikowski
Systems Biology Group – UP Biologie SystémiqueInstitut Pasteur, Paris
Benno Schwikowski
Overview
1. Molecular interaction networks 2. The Cytoscape platform
3. Active modules
Benno Schwikowski
Biology poses difficulties for approaches from physics and engineering
Courtesy L. Hood
• What are the effects of mutations on the system?• How do multiple mutations modify interactions with the environment?• How are molecular processes controlled?
Benno Schwikowski
Biological and engineering models
Lazebnik, Cancer Cell, Sep. 2002
Benno Schwikowski
Molecular interaction networks
Benno Schwikowski
Cartoons may not be enough
… and drawn cartoons
“Taken togethertogether, these data suggestsuggest that EBF & E2A maymay be less importantless important than
Pax-5 for regulating lowlow--levellevel mb-1 transcription at laterlater stages of development.”
Mol. Cell Biol. Dec’02, p.8850
These stories might be
• too qualitative• too hard to integrate
Benno Schwikowski
Cartoons are hard to combine
• The stories are designed around specific questions
• They are not written against a single conceptual scaffold
• We may not be able to integrate cartoons into coherent models
Courtesy H. Bolouri
Benno Schwikowski
Uses for the study of biological networks and systems
• Better characterize the function of single genes• Help structure, represent and interpret experimental data on
interactions and states– Integrate different types of experimental data– Relate mechanisms to states
• Build a detailed understanding of cellular processes – Allow prediction of cellular state observables at different levels of
detail– Allow intervention with predictable & measurable outcomes
• Guide experiments by providing testable hypotheses • Compare processes within and across organisms
Benno Schwikowski
The source of the network parts list
DNAmRNA
ProteinsPathways
NetworksCells
TissuesOrgans
IndividualsPopulationsEcosystems
DNA sequences
DNA sequencer
Benno Schwikowski
Sources of network state information
DNAmRNA
ProteinsPathways
NetworksCells
TissuesOrgans
IndividualsPopulationsEcosystems
DNA microarray
Benno Schwikowski
Sources of network state information
DNAmRNA
ProteinsPathways
NetworksCells
TissuesOrgans
IndividualsPopulationsEcosystems
Mass spectrum
Mass spectrometer
Benno Schwikowski
Reporter Gene
BaitProtein
BindingDomain
Prey Protein
ActivationDomain
• Two hybrid proteins are generated with transcription factor domains• Both fusions are expressed in a yeast cell that carries a reporter gene
whose expression is under the control of binding sites for the DNA-binding domain
Sources of network interaction information:The Two-Hybrid System
Benno Schwikowski
Reporter Gene
BaitProtein
BindingDomain
Prey Protein
ActivationDomain
• Interaction of bait and prey proteins localizes the activation domain to the reporter gene, thus activating transcription.
• Since the reporter gene typically codes for a survival factor, yeast colonies will grow only when an interaction occurs.
Sources of network interaction information:The Two-Hybrid System
Benno Schwikowski
Sources of network information:ChIP-chip
From Richard Young’s Websitehttp://web.wi.mit.edu/young/location/
CHromatinImmunoPreciptation-Chip(ChIP-Chip) Analysis(Ren et al., Science, 2000)
Metabolic networks are fairly detailed
• Stoichiometric matrix – network topology with stoichiometry of biochemical reactions
Mass balanceS·v = 0Subspace of R
Thermodynamicvi > 0Convex cone
Capacityvi < vmaxBounded convex cone
Glucose + ATP
Glucokinase
Glucose-6-Phosphate + ADP
Glucose -1ATP -1
G-6-P +1ADP +1
Glucokinase
n
“Comparative assessment of large-scale data sets of protein-protein interactions.” Von Mering, C. Nature 2002
“Among the [protein-protein] interactions proposed by high-throughput methods will be many false positives. In fact, we estimate that more than half of all current high-throughput data are spurious.”
Benno Schwikowski
Gene regulation can be complex
Yuh, Bolouri, Davidson, Science, 1998
Benno Schwikowski
High- and low-level modelingmay be combined
Ideker and Lauffenburger, Trends in Biotechnology 2003
Benno Schwikowski
High- and low-level modeling
Ideker and Lauffenburger, Trends in Biotechnology 2003
CytoscapeNetwork Visualization and Analysis
Courtesy M. Smoot
http://cytoscape.org 22
Cytoscape OverviewRich network visualizations
Powerful data mapping
Handles large networks
Supports many standards
Large community
Free (open-source)!
http://cytoscape.org 23
Network Data Import
SIF (Simple Interaction Format)
GML (Graph Markup Language)
XGMML (eXtensible Graph Markup and Modeling Language)
BioPax (Biological Pathway Data)
PSI-MI 1 & 2.5 (Protein Standards Initiative)
SBML Level 2 (Systems Biology Markup Language)
http://cytoscape.org 24
Formatted Text and Excel Files
http://cytoscape.org 25
Network Attribute Management
http://cytoscape.org 26
Data Integration1. Network Data
2. Attribute Data
YDR382W pp YDL130WYDR382W pp YFL039CYFL039C pp YCL040WYFL039C pp YHR179W
ExpressionValueYCL040W = 0.542YDL130W = -0.123YDR382W = -0.058YFL039C = 0.192YHR179W = 0.078
VizMapper
http://cytoscape.org 27
VizMapper
Map network state data onto visual attributes.
Attributes for nodes and edges.
Very Flexible.
http://cytoscape.org 28
Expression Data Node Color
http://cytoscape.org 29
Layout Algorithms
http://cytoscape.org 30
Network Editor
http://cytoscape.org 31
Filters
http://cytoscape.org 32
Linkout
Nodes and Edges act as hyperlinks to external databases.
User configurable URLs.
http://cytoscape.org 33
Large Networks
19,462 Nodes
31,130 Edges
Only half of what's possible!
http://cytoscape.org 34
Other Features
Manual Layout manipulation tools− align, scale, rotate
Manually override visual stylesUndo− Can undo most modifications to graphs
Publication Quality Graphics− Export PDF, SVG, PS
http://cytoscape.org 35
Cytoscape is ExtensibleCytoscape is open-source.
We provide a plug-in interface that allows anyone to write and distribute their own extensions to Cytoscape.
Plug-ins represent the primary analysis mechanism in Cytoscape.
Plug-ins are distributed from a central database and can installed while running.
http://cytoscape.org 36
Plugin ExamplesBiNGO (Analysis of GO categories found in network)
GenePro (Protein-Protein interaction cluster visualization)
jActiveModules (Search for significant networks)
NetworkAnalyzer (Statistical analysis of networks)
Agilent Literature Search (Network creation)
CyGoose (Gaggle communication)
See http://cytoscape.org for many more
http://cytoscape.org 37
Running Cytoscape
Cytoscape is licensed under the LGPL and is therefore freely available to everyone.
Cytoscape is written in Java and therefore runs on Windows, Mac, and Linux.
Cytoscape can be run locally or using Webstart.
http://cytoscape.org 38
Cytoscape applications
Cytoscape facilitates:− Network Visualization
− Network Analysis
− Data Integration
− A framework for new types of analysis
http://cytoscape.org 39
Cytoscape Consortium UC San Diego (Trey Ideker)
Institute for Systems Biology (Leroy Hood/Ilya Shmulevich)
Memorial Sloan-Kettering Cancer Center (Chris Sander)
University of Toronto (Gary Bader)
Agilent Technologies (Annette Adler)
Unilever (Guy Warner)
UC San Francisco (Bruce Conklin)
Institut Pasteur (Benno Schwikowski)
NIGMS/NIH GM070743-01
Getting started with Cytoscape
Tutorials on Cytoscape.org
Nature Protocols paper
Systemsbiology.fr
QuickTime™ and a decompressor
are needed to see this picture. QuickTime™ and a decompressor
are needed to see this picture.
Benno Schwikowski
Active Modules
Benno Schwikowski
Protein interaction networks
Benno Schwikowski
Protein-protein interactionsin yeast
Questions
• Is there any correlation between protein interactions and other attributes of proteins?• Is that correlation significant, i.e., would it not easily occur in random data?
Benno Schwikowski
Functionally related proteinsoccur as clusters of interacting proteins
Benno Schwikowski
Protein interactions contain informationabout cellular roles
Simple prediction algorithm for the cellular role of a protein
1) Rank known cellular roles among the interactorsfrom most frequent to least frequent.
2) Take the first three (or less) roles as predictions.
Accuracy on 1,393 out of 2,039 proteins: 72% (6 out of 8)…on 100 scrambled networks: 12% (1 out of 8).
Benno Schwikowski
Protein interactions providecontext information
RNA splicing
Mayer & Hieter, Nature Biotechnology 2000
Benno Schwikowski
Modular structure of cellular networks
Hartwell et al., Nature 1999
Benno Schwikowski
The cell as an information processor
Hartwell et al., Nature 1999
Benno Schwikowski
Advantage of modules
• TheoreticalThere are 2^n different boolean functions on n variables
• Practical implicationThere are fewer components and fewer experiments to perform
00001111
00110011
01010101
00010101
x1 x2 x3 f
Benno Schwikowski
Molecular Interaction Network
Benno Schwikowski
The “system” notion
Benno Schwikowski
Approach
1. Use interaction data:The system components have to interact with each other
2. Use state data:System components have to change synchronously
Benno Schwikowski
Conditions -> gal1D gal2D gal3D gal4D gal5D gal6D gal7D ga
COX6 0.034 0.052 0.152 0.111 0.198 0.097 0.171NDT80 0.09 0 0.041 0.007 0.157 0.035 0.037PRS1 0.167 0.063 0.23 0.233 0.003 0.234 0.25UPF3 0.245 0.415 0.253 0.471 0.115 0.111 0.061OPI1 0.174 0.045 0.046 0.015 0.098 0.001 0.029YGR145W 0.387 0 0.036 0.577 0.151 0.255 0.101YGL041C 0.285 0.232 0.126 0.086 0.096 0.002 0.21CRM1 0.018 0.009 0.07 0.001 0.052 0.028 0.017HIS3 0.432 0.568 0.339 0.71 0.188 0.07 0.619CIT2 0.085 0.272 0.038 0.392 0.168 0.077 0.416KHS1 0.159 0.168 0.149 0.139 0.293 0.023 0.043YBR026C 0.276 0.072 0.324 0.189 0.014 0.142 0.243YMR244W 0.078 0 0.077 0.239 0.077 0.254 0.126YMR317W 0.181 0.324 0.065 0.086 0.288 0.122 0.233YAR047C 0.234 0.121 0.019 0.109 0.107 0.05 0.156DAL7 0.289 0.168 0.09 0.161 0.017 0.041 0.091YDL177C 0.002 0.295 0.041 0.367 0.183 0.205 0.085YLR338W 0.216 0.091 0.051 0.096 0.07 0.044 0.082YGR073C 0.125 0.394 0.056 0.126 0.218 0.088 0.122YGR146C 0.189 0.308 0.345 0.067 0.432 0.014 0.116
Approach – Summary
Experiments
Gen
es
2. Differential Gene/ProteinAbundances/Activities
1. Interaction networkbetween
genes/proteinsConditions -> gal1D gal2D gal3D gal4D gal5D gal6D gal7D ga
COX6 0.034 0.052 0.152 0.111 0.198 0.097 0.171NDT80 0.09 0 0.041 0.007 0.157 0.035 0.037PRS1 0.167 0.063 0.23 0.233 0.003 0.234 0.25UPF3 0.245 0.415 0.253 0.471 0.115 0.111 0.061OPI1 0.174 0.045 0.046 0.015 0.098 0.001 0.029YGR145W 0.387 0 0.036 0.577 0.151 0.255 0.101YGL041C 0.285 0.232 0.126 0.086 0.096 0.002 0.21CRM1 0.018 0.009 0.07 0.001 0.052 0.028 0.017HIS3 0.432 0.568 0.339 0.71 0.188 0.07 0.619CIT2 0.085 0.272 0.038 0.392 0.168 0.077 0.416KHS1 0.159 0.168 0.149 0.139 0.293 0.023 0.043YBR026C 0.276 0.072 0.324 0.189 0.014 0.142 0.243YMR244W 0.078 0 0.077 0.239 0.077 0.254 0.126YMR317W 0.181 0.324 0.065 0.086 0.288 0.122 0.233YAR047C 0.234 0.121 0.019 0.109 0.107 0.05 0.156DAL7 0.289 0.168 0.09 0.161 0.017 0.041 0.091YDL177C 0.002 0.295 0.041 0.367 0.183 0.205 0.085YLR338W 0.216 0.091 0.051 0.096 0.07 0.044 0.082YGR073C 0.125 0.394 0.056 0.126 0.218 0.088 0.122YGR146C 0.189 0.308 0.345 0.067 0.432 0.014 0.116
Benno Schwikowski
Comparison to clustering
1. Connectivity by scaffold of protein interactions
Direct causal explanations and testable hypotheses
2. Significant change observed under certainexperimental conditions
Module need not be active under allexperimental conditions
Benno Schwikowski
Galactose induction pathway
Ideker et al. Science 292: 929 (2001)
Benno Schwikowski
What are the underlying regulatory interactions responsible for the observed changes
in gene expression?
Prot.–prot. interactions
BIND~ 6300 proteins, 55785 interactions in yeast
RNA-expression data
• 20 perturbations of thegalactose utilizationpathway
Prot.→DNA interactions
Transfac/ChIP data~10,000 interactionsfor yeast
Protein expression data
abundances, modifications, translation states
Small mol. interactions
Metabolites, drugs, andhormones: KEGG,enzymes, etc.
Metabolic profiles
Abundances may soon be avail. on a global scale
INTEGRATEDMOLECULAR
INTERACTIONNETWORK
This technique is extensible to a variety of data types.
Benno Schwikowski
protein→DNA
0
+3
Expression change(log10)
protein–protein
-3
Ideker et al. Science 292: 929 (2001)
The galactose pathwayin our network representation
We consider only the significance of change, not its direction.
Benno Schwikowski
Module – A mathematical definition
A scoring system for regulatory “activity”
• Assign significance to each gene expression change† and express as a z-score• The z-score of an entire subnetwork is the normalized sum of scores of its
nodes† Ideker, Thorsson, Siegel, and Hood, J. Comp. Bio. 7: 805 (2000)
A B C D
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡−−−
−
0312230320111221
4321
Pert
urba
tions
/c
ondi
tions
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡−−−
−
0312230320111221
4321
The
p(1.0)=0.159
z(0.159)=1.0
z-score
p-value
Combining z-scores under one condition
A B C D
⎥⎤
⎢⎡ − 1221
14
1221=
+−+
Scoring over multiple perturbations/conditions
Pertu
rbat
ions
/c
ondi
tions
A(1)
A(2)
A(3)
A(4)
Scoring over multiple conditionsRank adjustment
• What is the probability that, out of m z-scores, the first j ones are larger than A(j)?
• Idea: Compute the probability that j or morez-scores are larger than A(j):
where
Scoring over multiple perturbations/conditions
Pertu
rbat
ions
/c
ondi
tions
FinalScore
Benno Schwikowski
Different overlapping condition sets
Each subnetwork is active for a subset of conditions
Running the algorithm again on the high-scoring, 340-gene subnetwork reveals further structure
Each condition may appear several times, or not at all, depending on how well it is (a) significant and (b) explained by the interaction network.
Benno Schwikowski
Pathways in Rosetta’s compendium(300 conditions)
Getting started with Cytoscape
Tutorials on Cytoscape.org
Nature Protocols paper
Systemsbiology.fr
QuickTime™ and a decompressor
are needed to see this picture. QuickTime™ and a decompressor
are needed to see this picture.
Benno Schwikowski
THANK YOU FOR YOUR ATTENTION
Benno Schwikowski
Finding good modules
Benno Schwikowski
Finding good modules in a large network is hard
• Once specified, we can easily score a particular pathway.• But how to identify the highest-scoring pathways in a full molecular
interaction network of thousands of nodes and interactions?• This problem is NP-complete, • We use a customized version of a general-purpose algorithm to
detect high-scoring pathways from the data.
Use a method based on simulated annealing.
Benno Schwikowski
Computational complexity
• 6,000 genes form up to 26,000 possible gene sets• 300 conditions have 2300 subsets ⇒ 2180,000 > 1050,000 combinations to search
• Finding the highest-scoring gene set is NP-hard, even for a single condition
Benno Schwikowski
NP-hardness
• NP-hardness is a property of computational problems• It implies any algorithm that solves the problem runs at
least as long as thousands of other well-known problems
• Efficient algorithms for NP-hard problems are unknown (and probably don’t exist)
• Thus, need to look for approximation or heuristicalgorithms
Benno Schwikowski
20 GAL conditions vs.the entire interaction network
Benno Schwikowski
Several subnetworks emerge
Benno Schwikowski
Detail of subnetwork 1bGalactose metabolism
Our method is only concerned with the significance of change, not its direction.
Gal4 doesn’t show dramatic expression change, but it is included because it connects and explains the other genes’differential expression.
Benno Schwikowski
Galactose induction pathway
Ideker et al. Science 292: 929 (2001)
Benno Schwikowski
SUMMARY
• Method for explaining gene expression profiles with molecular interactions found in the public databases.
•Results in testable hypotheses for the signaling and regulatory pathways behind observed gene expression changes.
Benno Schwikowski
Features of this approach
• Tries to define clusters of genes that show similar concerted reactions to perturbations
• Incorporates many data types• Robust against noise, false positive interactions• Many, and experiment-specific networks identified• Interpretive framework offers testable hypotheses
Benno Schwikowski
The “system” notion