Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey...

37
Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne Koller (Stanford)

Transcript of Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey...

Page 1: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Rich Probabilistic Models for Gene Expression

Eran Segal (Stanford)

Ben Taskar (Stanford)

Audrey Gasch (Berkeley)

Nir Friedman (Hebrew University)

Daphne Koller (Stanford)

Page 2: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Our Goals

Find patterns in gene expression data

Page 3: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

ExperimentsG

enes

Data Organization

Induced

Repressedi

j

Aij - mRNA level of gene i in experiment j

Page 4: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

ExperimentsG

enes

Standard Clustering Organization

Page 5: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Bi-Clustering Organization

ExperimentsG

enes Undetected

Similarity

Page 6: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Note: rows and columns no longer correspond to genes and

experiments

Desired Organization

Detect similarities over subsets

of genes and experiments

Page 7: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Clinical information

Experimental Details

Annotations(GO, MIPS, YPD)

ACGCCTA

Incorporate Heterogeneous Data

Find correlations directly

Focus on novel discoveries

Page 8: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Clinical information

Experimental Details

Annotations(GO, MIPS, YPD)

ACGCCTA

Our Approach

Level

Gene Cluster

LipidHSF

Endoplasmatic

GCN4

Exp. cluster

Exp. type

LEARNER

hypotheses

Page 9: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Level

Gene

Exp. cluster

Experiment

Gene Cluster

Expression

Probabilistic Relational Models(Koller & Pfeffer 98; Friedman,Getoor,Koller & Pfeffer 99)

Page 10: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Level

Gene

Exp. cluster

Experiment

Gene Cluster

Expression

+Resulting Bayesian Network

Gene Cluster1

Level1,1

Gene Cluster2

Gene Cluster3

Exp. Cluster2Exp. Cluster1

Level2,1 Level2,2

Level3,1 Level3,2

Level1,2

Page 11: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

GCluster ECluster

1 1 0.8 1.2 1 2 -0.7 0.6

CPD

Level

Gene

Exp. cluster

Experiment

Gene Cluster

Expression

Probabilistic Relational Models

0.8

P(Level)

Level

P(Level)

Level-0.7

Page 12: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Level

Gene

Exp. cluster

ExperimentGene Cluster

Adding Heterogeneous Data

Expression

Lipid

Endoplasmatic

Annotations

HSFGCN4

Binding sites

Exp. type

Experimental details

Page 13: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Level

Gene

Expression

Gene Cluster

LipidHSF

Endoplasmatic

GCN4

Exp. cluster

Experiment

Exp. type

+ Experimental Details

Annotations(GO, MIPS, YPD)

ACGCCTA

Resulting Bayesian Network

Level2,2

Level3,2

Level1,2

Gene Cluster1

Lipid1HSF1

Endoplasmatic1

GCN41

Gene Cluster2

Lipid2HSF2

Endoplasmatic2

GCN42

Gene Cluster3

Lipid3HSF3

Endoplasmatic3

GCN43

Exp. type1

Exp. cluster2

Exp. type2

Exp. cluster1

Level2,1

Level1,1

Level3,1

Page 14: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Level

Gene

Exp. cluster

ExperimentGene Cluster

LipidHSF

Endoplasmatic

GCN4 Exp. type

Expression

Problem: Exponential Blowup

GC LP END HSF EC TYP

1 No No No 1 1 0.8 1.2 1 No No No 1 2 0.7 0.6

6 parents 26 cases

k parents 2k cases!

Page 15: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Solution: Context Specificity

Level

DNA repair UV Light

Gene

Expression

Experiment

0 000

UV = NoUV = Yes

Repair = Yes Repair = No Repair = Yes Repair = No

Ultra Violet Light

DNA Damage

DNA repair genes transcribed

Page 16: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Solution: Context Specificity

Level

DNA repair UV Light

Gene

Expression

Experiment

00 00

00

UV = NoUV = Yes

00

Ultra Violet Light

DNA repair genes transcribed

DNA Damage

Page 17: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Solution: Context Specificity

Level

DNA repair UV Light

Gene

Expression

Experiment

00

00

UV = Yes

true false

Repair = Yestrue false

Ultra Violet Light

DNA repair genes transcribed

DNA Damage

Page 18: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Modeling Context Specificity

Level

Gene

Exp. cluster

ExperimentGene Cluster

LipidHSF

Endoplasmatic

GCN4 Exp. type

Expression

Grouping = a leaf in

the tree

Exp. Cluster = 2

HSF= Yes

true false

true

Lipid = Yes

false

GCN4 = Yes

true

. . .

false

GCN4 = Yes

-3

P(Level)

Level

. . .

truefalsetrue false2

P(Level)

Level

3

P(Level)

Level

0

P(Level)

Level

Page 19: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

How do I learn these models?

Page 20: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

LEARNER

Learning the Models

Experimental Details

Annotations(GO, MIPS, YPD)

ACGCCTAExp. Cluster = 2

HSF= YesLipid = Yes

GCN4 = Yes

. .

. GCN4 = Yes

. .

.

. .

.

. .

.

. .

.

. .

.

GC EC

… …

1 1 0.8 1.2 1 2 -0.7 0.6 2 1 0.8 1.2 2 2 -0.7 0.6

Level

Gene

Expression

Gene Cluster

LipidHSF

Endoplasmatic

GCN4

Exp. cluster

Experiment

Exp. type

Page 21: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Automatic Induction

Structure Learning:

Dependency structure

Tree structure

Missing Data:

Gene cluster &

experiment cluster

never observed

Bayesian score

Heuristic search

Expectation

Maximization (EM)

Learning Algorithm

Page 22: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Learning Process

Level

Gene

Exp. cluster

ExperimentGene Cluster

LipidHSF

Endoplasmatic

GCN4 Exp. type

Expression

Page 23: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Learning Process

Gene

Exp. cluster

ExperimentGene Cluster

LipidHSF

Endoplasmatic

GCN4 Exp. type

Expression

Experiment Similarity

Exp. Cluster = 2

Level

Page 24: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Learning Process

Gene

Exp. cluster

ExperimentGene Cluster

LipidHSF

Endoplasmatic

GCN4 Exp. type

Expression

Gene Similarity

Exp. Cluster = 2

Level

Gene Cluster = Yes

Page 25: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Learning Process

Gene

Exp. cluster

ExperimentGene Cluster

LipidHSF

Endoplasmatic

GCN4 Exp. type

Expression

Separability by binding site

Exp. Cluster = 2

Level

HSF= Yes

. .

.

. .

.

Gene Cluster = Yes

Page 26: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Learning Process

Gene

Exp. cluster

ExperimentGene Cluster

Lipid

Endoplasmatic

GCN4 Exp. type

Expression

Attribute dependencies: induce cluster changes

Exp. Cluster = 2

Level

HSF

HSF= Yes

. .

.

. .

.

Gene Cluster = Yes

Page 27: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Learning Process

Gene

Exp. cluster

ExperimentGene Cluster

Lipid

Endoplasmatic

GCN4 Exp. type

Expression

Exp. Cluster = 2

Level

HSF

HSF= Yes

GCN4 = YesGCN4 = Yes

. .

.

. .

.

. .

.

. .

.

. .

.

Achieved desired clustering

Gene Cluster = Yes

. .

.

Page 28: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Yeast Stress Data (Gasch et al 2001)

Measured response to stress cond.

92 arrays

We selected ~900 genes

Added data: TRANSFAC, MIPS

Results:

15 significant TFs

7 significant function categories

793 Groupings

Page 29: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Context Specific Groupings

Metabolism of amino acids

Transporter genes

Down in nitrogen depletion

Page 30: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Context Specific Groupings

Metabolism of nitrogen

Transporter genes

Up in Starvation, Nitrogen depletion & DTT

Page 31: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Example Biological Finding

Discovered grouping of 17 genes All induced in diauxic shift

All have 2 binding sites for MIG1 transcription factor

Many not known to be regulated by MIG1

Context-sensitive groupings were key to finding cluster

Page 32: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Compendium Data (Hughes et al 2000)

300 samples of yeast deletion mutants

Expression

Level

Gene

ACluster

GCluster

LipidLipid (of mutated

gene)

GCluster(of mutated

gene)HSF

Endoplasmatic

GCN4

Array/Mutated Gene

Page 33: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Level2,2

Level3,2

Level1,2

Gene Cluster1

HSF1

Gene Cluster2

HSF2

HSF3

Lipid1 Lipid3

Level1,1

Level3,1

Gene 1 mutant Gene 3 mutant

Array. cluster1 Array. cluster3

Gene 1

Gene 2

Gene 3

Level3,2

Gene Cluster4

HSF4

Level3,1

Level2,1

Gene 4

Gene Cluster3

Resulting Bayesian Network

Page 34: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Experimental Setup

Array. cluster

Example: predicting the effect of mutating gene 4

Gene 4 mutant

?

?

Available information: Attributes of gene 4 Lipid4

Gene Cluster4

HSF4

Gene Cluster of gene 4 as a gene

Goal: predict the effect of mutating specific genes without performing the experiment (!)

Page 35: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Experimental Setup

?

Lipid4

Array. cluster?

Level2,2

Level3,2

Level1,2

Gene Cluster1

HSF1

Gene Cluster2

HSF2

HSF3

Lipid1 Lipid3

Level1,1

Level3,1

Gene 1 mutant Gene 3 mutant

Array. cluster1 Array. cluster3

Level3,2

Gene Cluster4

HSF4

Level3,1

Level2,1

Gene Cluster3

Gene 4 mutant

Page 36: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

ResultsTraining set: 180 mutants

Level

Gene Cluster

LipidHSF

Endoplasmatic

GCN4

Exp. cluster

Exp. type Test set:20 mutants

44 arrays predicted at 99% confidence and 95% accuracy

Relational model is key to prediction

0102030405060708090

100

PRMs

Acc

ura

cy (

%)

95% accuracy

Page 37: Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

Conclusions Presented a unified probabilistic framework:

Models complex biological domains Expressive data organization Incorporates heterogeneous data

Future directions: Incorporate DNA and protein sequence data Discover regulatory networks

Paper: http://www.cs.stanford.edu/~eran Software (soon): http://dags.stanford.edu/bio Contact: [email protected]

Thank You!