Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey...
-
Upload
ezra-potter -
Category
Documents
-
view
224 -
download
3
Transcript of Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey...
Rich Probabilistic Models for Gene Expression
Eran Segal (Stanford)
Ben Taskar (Stanford)
Audrey Gasch (Berkeley)
Nir Friedman (Hebrew University)
Daphne Koller (Stanford)
Our Goals
Find patterns in gene expression data
ExperimentsG
enes
Data Organization
Induced
Repressedi
j
Aij - mRNA level of gene i in experiment j
ExperimentsG
enes
Standard Clustering Organization
Bi-Clustering Organization
ExperimentsG
enes Undetected
Similarity
Note: rows and columns no longer correspond to genes and
experiments
Desired Organization
Detect similarities over subsets
of genes and experiments
Clinical information
Experimental Details
Annotations(GO, MIPS, YPD)
ACGCCTA
Incorporate Heterogeneous Data
Find correlations directly
Focus on novel discoveries
Clinical information
Experimental Details
Annotations(GO, MIPS, YPD)
ACGCCTA
Our Approach
Level
Gene Cluster
LipidHSF
Endoplasmatic
GCN4
Exp. cluster
Exp. type
LEARNER
hypotheses
Level
Gene
Exp. cluster
Experiment
Gene Cluster
Expression
Probabilistic Relational Models(Koller & Pfeffer 98; Friedman,Getoor,Koller & Pfeffer 99)
Level
Gene
Exp. cluster
Experiment
Gene Cluster
Expression
+Resulting Bayesian Network
Gene Cluster1
Level1,1
Gene Cluster2
Gene Cluster3
Exp. Cluster2Exp. Cluster1
Level2,1 Level2,2
Level3,1 Level3,2
Level1,2
GCluster ECluster
1 1 0.8 1.2 1 2 -0.7 0.6
…
CPD
Level
Gene
Exp. cluster
Experiment
Gene Cluster
Expression
Probabilistic Relational Models
0.8
P(Level)
Level
P(Level)
Level-0.7
Level
Gene
Exp. cluster
ExperimentGene Cluster
Adding Heterogeneous Data
Expression
Lipid
Endoplasmatic
Annotations
HSFGCN4
Binding sites
Exp. type
Experimental details
Level
Gene
Expression
Gene Cluster
LipidHSF
Endoplasmatic
GCN4
Exp. cluster
Experiment
Exp. type
+ Experimental Details
Annotations(GO, MIPS, YPD)
ACGCCTA
Resulting Bayesian Network
Level2,2
Level3,2
Level1,2
Gene Cluster1
Lipid1HSF1
Endoplasmatic1
GCN41
Gene Cluster2
Lipid2HSF2
Endoplasmatic2
GCN42
Gene Cluster3
Lipid3HSF3
Endoplasmatic3
GCN43
Exp. type1
Exp. cluster2
Exp. type2
Exp. cluster1
Level2,1
Level1,1
Level3,1
Level
Gene
Exp. cluster
ExperimentGene Cluster
LipidHSF
Endoplasmatic
GCN4 Exp. type
Expression
Problem: Exponential Blowup
GC LP END HSF EC TYP
1 No No No 1 1 0.8 1.2 1 No No No 1 2 0.7 0.6
…
6 parents 26 cases
k parents 2k cases!
Solution: Context Specificity
Level
DNA repair UV Light
Gene
Expression
Experiment
0 000
UV = NoUV = Yes
Repair = Yes Repair = No Repair = Yes Repair = No
Ultra Violet Light
DNA Damage
DNA repair genes transcribed
Solution: Context Specificity
Level
DNA repair UV Light
Gene
Expression
Experiment
00 00
00
UV = NoUV = Yes
00
Ultra Violet Light
DNA repair genes transcribed
DNA Damage
Solution: Context Specificity
Level
DNA repair UV Light
Gene
Expression
Experiment
00
00
UV = Yes
true false
Repair = Yestrue false
Ultra Violet Light
DNA repair genes transcribed
DNA Damage
Modeling Context Specificity
Level
Gene
Exp. cluster
ExperimentGene Cluster
LipidHSF
Endoplasmatic
GCN4 Exp. type
Expression
Grouping = a leaf in
the tree
Exp. Cluster = 2
HSF= Yes
true false
true
Lipid = Yes
false
GCN4 = Yes
true
. . .
false
GCN4 = Yes
-3
P(Level)
Level
. . .
truefalsetrue false2
P(Level)
Level
3
P(Level)
Level
0
P(Level)
Level
How do I learn these models?
LEARNER
Learning the Models
Experimental Details
Annotations(GO, MIPS, YPD)
ACGCCTAExp. Cluster = 2
HSF= YesLipid = Yes
GCN4 = Yes
. .
. GCN4 = Yes
. .
.
. .
.
. .
.
. .
.
. .
.
GC EC
… …
1 1 0.8 1.2 1 2 -0.7 0.6 2 1 0.8 1.2 2 2 -0.7 0.6
Level
Gene
Expression
Gene Cluster
LipidHSF
Endoplasmatic
GCN4
Exp. cluster
Experiment
Exp. type
Automatic Induction
Structure Learning:
Dependency structure
Tree structure
Missing Data:
Gene cluster &
experiment cluster
never observed
Bayesian score
Heuristic search
Expectation
Maximization (EM)
Learning Algorithm
Learning Process
Level
Gene
Exp. cluster
ExperimentGene Cluster
LipidHSF
Endoplasmatic
GCN4 Exp. type
Expression
Learning Process
Gene
Exp. cluster
ExperimentGene Cluster
LipidHSF
Endoplasmatic
GCN4 Exp. type
Expression
Experiment Similarity
Exp. Cluster = 2
Level
Learning Process
Gene
Exp. cluster
ExperimentGene Cluster
LipidHSF
Endoplasmatic
GCN4 Exp. type
Expression
Gene Similarity
Exp. Cluster = 2
Level
Gene Cluster = Yes
Learning Process
Gene
Exp. cluster
ExperimentGene Cluster
LipidHSF
Endoplasmatic
GCN4 Exp. type
Expression
Separability by binding site
Exp. Cluster = 2
Level
HSF= Yes
. .
.
. .
.
Gene Cluster = Yes
Learning Process
Gene
Exp. cluster
ExperimentGene Cluster
Lipid
Endoplasmatic
GCN4 Exp. type
Expression
Attribute dependencies: induce cluster changes
Exp. Cluster = 2
Level
HSF
HSF= Yes
. .
.
. .
.
Gene Cluster = Yes
Learning Process
Gene
Exp. cluster
ExperimentGene Cluster
Lipid
Endoplasmatic
GCN4 Exp. type
Expression
Exp. Cluster = 2
Level
HSF
HSF= Yes
GCN4 = YesGCN4 = Yes
. .
.
. .
.
. .
.
. .
.
. .
.
Achieved desired clustering
Gene Cluster = Yes
. .
.
Yeast Stress Data (Gasch et al 2001)
Measured response to stress cond.
92 arrays
We selected ~900 genes
Added data: TRANSFAC, MIPS
Results:
15 significant TFs
7 significant function categories
793 Groupings
Context Specific Groupings
Metabolism of amino acids
Transporter genes
Down in nitrogen depletion
Context Specific Groupings
Metabolism of nitrogen
Transporter genes
Up in Starvation, Nitrogen depletion & DTT
Example Biological Finding
Discovered grouping of 17 genes All induced in diauxic shift
All have 2 binding sites for MIG1 transcription factor
Many not known to be regulated by MIG1
Context-sensitive groupings were key to finding cluster
Compendium Data (Hughes et al 2000)
300 samples of yeast deletion mutants
Expression
Level
Gene
ACluster
GCluster
LipidLipid (of mutated
gene)
GCluster(of mutated
gene)HSF
Endoplasmatic
GCN4
Array/Mutated Gene
Level2,2
Level3,2
Level1,2
Gene Cluster1
HSF1
Gene Cluster2
HSF2
HSF3
Lipid1 Lipid3
Level1,1
Level3,1
Gene 1 mutant Gene 3 mutant
Array. cluster1 Array. cluster3
Gene 1
Gene 2
Gene 3
Level3,2
Gene Cluster4
HSF4
Level3,1
Level2,1
Gene 4
Gene Cluster3
Resulting Bayesian Network
Experimental Setup
Array. cluster
Example: predicting the effect of mutating gene 4
Gene 4 mutant
?
?
Available information: Attributes of gene 4 Lipid4
Gene Cluster4
HSF4
Gene Cluster of gene 4 as a gene
Goal: predict the effect of mutating specific genes without performing the experiment (!)
Experimental Setup
?
Lipid4
Array. cluster?
Level2,2
Level3,2
Level1,2
Gene Cluster1
HSF1
Gene Cluster2
HSF2
HSF3
Lipid1 Lipid3
Level1,1
Level3,1
Gene 1 mutant Gene 3 mutant
Array. cluster1 Array. cluster3
Level3,2
Gene Cluster4
HSF4
Level3,1
Level2,1
Gene Cluster3
Gene 4 mutant
ResultsTraining set: 180 mutants
Level
Gene Cluster
LipidHSF
Endoplasmatic
GCN4
Exp. cluster
Exp. type Test set:20 mutants
44 arrays predicted at 99% confidence and 95% accuracy
Relational model is key to prediction
0102030405060708090
100
PRMs
Acc
ura
cy (
%)
95% accuracy
Conclusions Presented a unified probabilistic framework:
Models complex biological domains Expressive data organization Incorporates heterogeneous data
Future directions: Incorporate DNA and protein sequence data Discover regulatory networks
Paper: http://www.cs.stanford.edu/~eran Software (soon): http://dags.stanford.edu/bio Contact: [email protected]
Thank You!