Advanced Algorithms and Models for Computational Biologyepxing/Class/10810-06/lecture24.pdf ·...

1

Advanced Algorithms Advanced Algorithms and Models for and Models for

Computational BiologyComputational Biology---- a machine learning approacha machine learning approach

Systems Biology:Systems Biology:Inferring gene regulatory network using Inferring gene regulatory network using

graphical modelsgraphical models

Eric XingEric XingLecture 24, April 17, 2006

Gene regulatory networks

Regulation of expression of genes is crucial: Genes control cell behavior by controlling which proteins are made by a cell and when and where they are delivered

Regulation occurs at many stages:pre-transcriptional (chromatin structure)transcription initiationRNA editing (splicing) and transportTranslation initiationPost-translation modificationRNA & Protein degradation

Understanding regulatory processes is a central problem of biological research

2

mostly sometimes rarely

E

R

B

A

C

+ +

Inferring gene regulatory networks

Expression network

gets most attentions so far, many algorithmsstill algorithmically challenging

Protein-DNA interaction network

actively pursued recently, some interesting methods availablelots of room for algorithmic advance

Inferring gene regulatory networks

Network of cis-regulatory pathways

Success stories in sea urchin, fruit fly, etc, from decades of experimental research Statistical modeling and automated learning just started

3

1: Expression networks

Early work Clustering of expression data

Groups together genes with similar expression patternDisadvantage: does not reveal structural relations between genes

Boolean NetworkDeterministic models of the logical interactions between genesDisadvantage: deterministic, static

Deterministic linear modelsDisadvantage: under-constrained, capture only linear interactions

The challenge:Extract biologically meaningful information from the expression dataDiscover genetic interactions based on statistical associations among data

Currently dominant methodologyProbabilistic network

Probabilistic Network Approach

Characterize stochastic (non-deterministic!) relationships between expression patterns of different genes

Beyond pair-wise interactions => structure!Many interactions are explained by intermediate factorsRegulation involves combined effects of several gene-products

Flexible in terms of types of interactions (not necessarily linear or Boolean!)

4

X1

X2

X3

X4 X5

X6

• Given a graph G = (V,E), where each node v∈V is associated with a random variable Xv

p(X1, X2, X3, X4, X5, X6) = p(X1) p(X2| X1)p(X3| X2) p(X4| X1)p(X5| X4)p(X6| X2, X5)

• The joint distribution on (X1, X2,…, XN) factors according to the “parent-of” relations defined by the edges E :

p(X6| X2, X5)

p(X1)

p(X5| X4)p(X4| X1)

p(X2| X1)

p(X3| X2) AAAAGAGTCA

X1

X2

X3

X4 X5

X6

Probabilistic graphical models

Probabilistic Inference

Computing statistical queries regarding the network, e.g.:Is node X independent on node Y given nodes Z,W ?What is the probability of X=true if (Y=false and Z=true)?What is the joint distribution of (X,Y) if R=false?What is the likelihood of some full assignment?What is the most likely assignment of values to all or a subset the nodes of the network?

General purpose algorithms exist to fully automate such computation

Computational cost depends on the topology of the networkExact inference:

The junction tree algorithmApproximate inference;

Loopy belief propagation, variational inference, Monte Carlo sampling

5

Two types of GMs

Directed edges give causality relationships (Bayesian Network or Directed Graphical Model):

Undirected edges simply give correlations between variables (Markov Random Field or Undirected Graphical model):

X

Y1 Y2

Descendent

Ancestor

Parent

Children's co-parentChildren's co-parent

Child

Bayesian NetworkStructure: DAG

Meaning: a node is conditionally independentof every other node in the network outside its Markov blanket

Location conditional distributions (CPD) and the DAG completely determines the joint dist.

6

A CB

A

C

B

A

B

C

Local Structures & Independencies

Common parentFixing B decouples A and C"given the level of gene B, the levels of A and C are independent"

CascadeKnowing B decouples A and C"given the level of gene B, the level gene A provides no extra prediction value for the level of gene C"

V-structureKnowing C couples A and Bbecause A can "explain away" B w.r.t. C"If A correlates to C, then chance for B to also correlate to B will decrease"

The language is compact, the concepts are rich!

Why Bayesian Networks?Sound statistical foundation and intuitive probabilistic semanticsCompact and flexible representation of (in)dependencystructure of multivariate distributions and interactionsNatural for modeling global processes with local interactions => good for biologyNatural for statistical confidence analysis of results and answering of queriesStochastic in nature: models stochastic processes & deals (“sums out”) noise in measurementsGeneral-purpose learning and inferenceCapture causal relationships

7

Measured expression level of each gene

Gene interaction

Random variables(node)

Probabilistic dependencies(edge)

Common cause

Intermediate gene

Common/combinatorial effects

A CB

AC

B

A

B

C

Possible Biological Interpretation

Dynamic Bayesian Networks (Ong et. al)

Temporal series "static" gene 1

gene 2 gene

3gene

N

gene 1

gene 2 gene

3gene

N

gene 1

gene 2 gene

3gene

N

gene 1

gene 2 gene

3gene

N

gene 1

gene 2 gene

3gene

N

gene 1

gene 2 gene

3gene

N

Module network (Segal et al.)Partition variables into modules that share the same parents and the same CPD.

Probabilistic Relational Models (Segal et al.)

Data fusion: integrating related data from multiple sources

More directed probabilistic networks

8

0.9 0.1

e

be

0.2 0.8

0.01 0.990.9 0.1

bebb

e

BE P(A | E,B)Earthquake

Radio

Burglary

Alarm

Call

Bayesian Network – CPDs

Local Probabilities: CPD - conditional probability distribution P(Xi|Pai)

Discrete variables: Multinomial Distribution (can represent any kind of statistical dependency)

XY

P(X

| Y)

),(~),...,|( 2

101 σ∑

=

+k

iiik yaaNYYXP

Bayesian Network – CPDs (cont.)Continuous variables: e.g. linear Gaussian

9

E

R

B

A

C

0.9 0.1

e

be

0.2 0.8

0.01 0.990.9 0.1

bebb

e

BE P(A | E,B)

E

R

B

A

C

Learning Bayesian NetworkThe goal:

Given set of independent samples (assignments of random variables), find the best (the most likely?) Bayesian Network (both DAG and CPDs)

(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)

……..(B,E,A,C,R)=(F,T,T,T,F)

• Learning of best CPDs given DAG is easy– collect statistics of values of each node given specific assignment to its parents

• Learning of the graph topology (structure) is NP-hard– heuristic search must be applied, generally leads to a locally optimal network

• Overfitting– It turns out, that richer structures give higher likelihood P(D|G) to the data

(adding an edge is always preferable)

– more parameters to fit => more freedom => always exist more "optimal" CPD(C)

• We prefer simpler (more explanatory) networks– Practical scores regularize the likelihood improvement complex networks.

AC

BAC

B

),|(≤)|( BACPACP

Learning Bayesian Network

10

E

R

B

A

C

Learning Algorithm

Expression data

Structural EM (Friedman 1998)The original algorithm

Sparse Candidate Algorithm (Friedman et al.)Discretizing array signalsHill-climbing search using local operators: add/delete/swap of a single edgeFeature extraction: Markov relations, order relationsRe-assemble high-confidence sub-networks from features

Module network learning (Segal et al.)Heuristic search of structure in a "module graph"Module assignmentParameter sharingPrior knowledge: possible regulators (TF genes)

BN Learning Algorithms

Bootstrap approach:

D resample

resample

resample

D1

D2

Dm

...

Learn

Learn

Learn

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C{ }∑=

∈=m

iiGf

mfC

111)(

Estimate “Confidence level”:

Confidence Estimates

11

The initially learned network of ~800 genes

KAR4

AGA1PRM1TEC1

SST2

STE6

KSS1NDJ1

FUS3AGA2

YEL059W

TOM6 FIG1YLR343W

YLR334C MFA1

FUS1

KAR4

AGA1PRM1

KAR4

AGA1PRM1TEC1

SST2

STE6

KSS1NDJ1TEC1

SST2

STE6

KSS1NDJ1

FUS3AGA2

YEL059W

TOM6 FIG1 FUS3AGA2

YEL059W

TOM6 FIG1YLR343W

YLR334C MFA1

FUS1

YLR343W

YLR334C MFA1

FUS1

The “mating response” substructure

Results from SCA + feature extraction (Friedman et al.)

Nature Genetics 34, 166 - 176 (2003)

A Module Network

12

Why?

Sometimes an UNDIRECTEDassociation graph makes more sense and/or is more informative

gene expressions may be influenced by unobserved factor that are post-transcriptionallyregulated

The unavailability of the state of B results in a constrain over A and C

BA C

BA C

BA C

Gaussian Graphical Models

Multivariate Gaussian over all continuous expressions

The precision matrix K=Σ−1 reveals the topology of the (undirected) network

Edge ~ |Kij| > 0

Learning Algorithm: Covariance selectionWant a sparse matrix

Regression for each node with degree constraint (Dobra et al.)Regression for each node with hierarchical Bayesian prior (Li, et al)

∑ )K/K(=)|( -j

jiiijii xxxE

{ })-()-(-exp||)2(

1=]),...,([ 1-

21

1 21

2µµ

πxxxxp T

n n

rrΣ

Σ

Covariance Selection

13

A comparison of BN and GGM:

GG

M

BN

Gene modules from identified using GGM (Li, Yang and Xing)

BA C

BA Cand

2: Protein-DNA Interaction Network

Expression networks are not necessarily causalBNs are indefinable only up to Markov equivalence:

can give the same optimal score, but not further distinguishable under a likelihood score unless further experiment from perturbation is performed

GGM have yields functional modules, but no causal semantics

TF-motif interactions provide direct evidence of casual, regulatory dependencies among genes

stronger evidence than expression correlationsindicating presence of binding sites on target gene -- more easily verifiabledisadvantage: often very noisy, only applies to cell-cultures, restricted to known TFs ...

14

Simon I et al (Young RA) Cell 2001(106):697-708Ren B et al (Young RA) Science 2000 (290):2306-2309

Advantages:- Identifies “all” the sites where a TF

binds “in vivo” under the experimental condition.

Limitations:- Expense: Only 1 TF per experiment- Feasibility: need an antibody for the TF.- Prior knowledge: need to know what TF

to test.

ChIP-chip analysis

Often reduced to a clustering problem

A transcriptional module can be defined as a set of genes that are:1. co-bound (bound by the same set of TFs, up to limits of experimental noise) and2. co-expressed (has the same expression pattern, up to limits of experimental noise).

Genes in the same module are likely to be co-regulated, and hence likely have a common biological function.

[Horak, et al, Genes & Development, 16:3017-3033]

Transcription network

15

Learning Algorithm

Array/motif/chip data

+ ++ +Clustering + Post-processing

SAMBA (Tanay et al.)

GRAM (Genetic RegulAtory Modules) (Bar-Joseph et al.)

Bayesian network

Joint clustering model for mRNA signal, motif, and binding score (Segal, et al.)

Using binding data to define Informative structural prior (Bernard & Hartemink)(the only reported work that directly learn the network)

Inferring transcription network

conditions

gene

s

edge

no edge

G=(U,V,E)Goal : Find high

similarity submatricesGoal : Find dense

subgraphs

The GE (gene-condition) matrix

The SAMBA model

16

Background Random Graph Model

Bicluster Random Subgraph Model Likelihood model

translates to sum of weights over edges

and non edges=

The SAMBA approach

Normalization: translate gene/experiment matrix to a weighted bipartite graph using a statistical model for the data

Bicluster model: Heavy subgraphsCombined hashing and local optimization (Tanay et al.): incrementally finds heavy-weight biclusters that can partially overlap

Other algorithms are also applicable, e.g., spectral graph cut, SDP: but they mostly finds disjoint clusters

From experiments to properties:

StrongInduction

MediumInduction

MediumRepression

StrongRepression

p1 p2 p3 p4

StrongBinding toTF T

MediumBinding toTF T

HighSensitivity

MediumSensitivity

High ConfidenceInteraction

Medium ConfidenceInteraction

p1

Strong complex binding toprotein Pp2

Medium complex binding toProtein P

p1 p2 p1 p2 p1 p2

gene g

Date pre-processing

17

Gene

s

Properties

GO annotationsCPA1 CPA2

A SAMBA module

ProteinBiosynthesis

RNAProcessing

RibosomBiogenesys

StressRap1

Fhl1

STRE

AP-1AGGGG

CGATGAG

AAAATTTT

Post-clustering analysis

Using topological properties to identify genes connecting several processes.

Specialized groups of genes (transport, signaling) are more likely to bridge modules.

Inferring the transcription networkGenes in modules with expression properties should be driven by at least one common TFThere exist gaps between information sources (condition specific TF binding?)Enriched binding motifs help identifying “missing TF profiles”

18

Cluster genes that co-express

Group genes bound by (nearly) the same set of TFs

Statistically significant union

The GRAM model

rr

1011g6

1111g5

0000g4

1111g3

1101g2

1011g1

Gcn4Leu3Arg81Arg80

1011g6

1111g5

0000g4

1111g3

1101g2

1011g1

Gcn4Leu3Arg81Arg80

√√

√

√

√

x

.0001.5.00007.00001g6

.0002.0001.00001.00001g5

.70.04.2.007g4

.0002.15.002.0007g3

.0001.020.0006.00002g2

.0004.33.00003.0004g1

Gcn4Leu3Arg81Arg80

.0001.5.00007.00001g6

.0002.0001.00001.00001g5

.70.04.2.007g4

.0002.15.002.0007g3

.0001.020.0006.00002g2

.0004.33.00003.0004g1

Gcn4Leu3Arg81Arg80

√

The GRAM algorithmExhaustively search all subsets of TFs(starting with the largest sets); find genes bound by subset F with significance p: G(F,p)

Find an r-tight gene cluster with centroid c*: B(c*,r)|, s.t.

c* = argmaxc |G(F,p) ∩ B(c,r)|

Some local massage to further improve cluster significance

19

Regulatory Interactions revealed by GRAM

TF Binding sites

p-value for localization assay

regulation indicator

expression level

experimentalconditions

The Full Bayesian Network Approach

20

The Full BN Approach, cond.

Introduce random variables for each observations, and latent states/events to be modeled

A network with hidden variables: regulation indicator and binding sites not observed

Results in a very complex model that integrates motif search, gene activation/repression function, mRNA level distribution, ...

Advantage: in principle allow heterogeneous information to flow and cross influence to produce a globally optimal and consistent solutionDisadvantage: in practice, a very difficult statistical inference and learning problem. Lots of heuristic tricks are needed, computational solutions are usually not a globally optimum.

E

R

B

A

C

E

R

B

A

C

p(Θ )

• Bayes’ rule: )()|()|( ΘΘ∝Θ ppp AAposterior likelihood prior

• Bayesian estimation: ΘΘΘ=Θ ∫ dpBayes )|( A

A Bayesian Learning ApproachFor model p(A|Q ):

Treat parameters/model Q as unobserved random variable(s) probabilistic statements of Q is conditioned on the values of the observed variables Aobs

21

Informative Structure Prior (Bernard & Hartemink)

If we observe TF i binds Gene j from location assay, then the probability of have an edge between node i and j in a expression network is increased

This leads to a prior probability of an edge being present:

This is like adding a regularizer (a compensating score) to the originally likelihood score of an expression network

The regularized score is now ...

Informative Structure Prior (Bernard & Hartemink)

The regularized score of an expression network under an informative structure prior is:

Score = P(array profile|Chip profile) = logP(array profile|S) + logP(S)

Now we run the usual structural learning algorithm as in expression network, but now optimizing

logP(array profile|S) + logP(S), instead logP(array profile|S)

22

We've found a network…now what?

How can we validate our results?

Literature.Curated databases (e.g., GO/MIPS/TRANSFAC).Other high throughput data sources.“Randomized” versions of data.New experiments.Active learning: proposing the "most informative" new experiments

What is next?

We are still far away from being able to correctly infer cis-regulatory network in silico, even in a simple multi-cellular organism

Open computational problems: Causality inference in networkAlgorithms for identifying TF bindings motifs and motif clusters directly from genomic sequences in higher organismsTechniques for analyzing temporal-spatial patterns of gene expression (i.e., whole embryo in situ hybridization pattern)Probabilistic inference and learning algorithms that handle large networks

Experimental limitation:Binding location assay can not yet be performed inexpensively in vivo in multi-cellular organismsMeasuring the temporal-spatial patterns of gene expression in multi-cellular organisms is not yet high-throughput Perturbation experiment still low-throughput

23

Networks of cis-regulatory pathway

This strategy has been a hot trend, but faces many algorithmic challenges

More "Direct" Approaches

Principle:using "direct" structural/sequence/biochemical level evidences of network connectivity, rather than "indirect" statistical correlations of network products, to recover the network topology.

Algorithmic approachSupervised and de novo motif searchProtein function predictionprotein-protein interactions

Experimental approachIn principle feasible via systematic mutational perturbationsIn practice very tediousThe ultimate ground truth

The Endo16 Model (Eric Davidson, with tens of students/postdocs over years)

A thirty-year network for a single gene

24

Sources of some of the slides:

Mark GersteinMark Gerstein, Yale, YaleNirNir Friedman, Friedman, Hebrew UHebrew URon Ron ShamirShamir, TAU, TAUIrene Irene OngOng, Wisconsin, WisconsinGeorgGeorg Gerber, MITGerber, MIT

Acknowledgments

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Pe'er. In Proc. Fourth Annual Inter. Conf. on Computational Molecular Biology RECOMB, 2000.

Genome-wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression, E. Segal, R. Yelensky, and D. Koller. Bioinformatics, 19(Suppl 1): 273--82, 2003.

A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules Joshua M. Stuart, Eran Segal, Daphne Koller, Stuart K. Kim, Science, 302 (5643):249-55, 2003.

Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N. Nature Genetics, 34(2): 166-76, 2003.

Learning Module Networks E. Segal, N. FriedmanD. Pe'er, A. Regev, and D. Koller.In Proc. Nineteenth Conf. on Uncertainty in Artificial Intelligence (UAI), 2003

From Promoter Sequence to Expression: A Probabilistic Framework, E. Segal, Y. Barash, I. Simon, N. Friedman, and D. Koller. In Proc. 6th Inter. Conf. on Research in Computational Molecular Biology (RECOMB), 2002

Modelling Regulatory Pathways in E.coli from Time Series Expression Profiles, Ong, I. M., J. D. Glasner and D. Page, Bioinformatics, 18:241S-248S, 2002.

References

25

Sparse graphical models for exploring gene expression data, Adrian Dobra, Beatrix Jones, Chris Hans, Joseph R. Nevins and Mike West, J. Mult. Analysis, 90, 196-212, 2004.

Experiments in stochastic computation for high-dimensional graphical models, Beatrix Jones, Adrian Dobra, Carlos Carvalho, Chris Hans, Chris Carter and Mike West, Statistical Science, 2005.

Inferring regulatory network using a hierarchical Bayesian graphical gaussian model, Fan Li, Yiming Yang, Eric Xing (working paper) 2005.

Computational discovery of gene modules and regulatory networks. Z. Bar-Joseph*, G. Gerber*, T. Lee*, N. Rinaldi, J. Yoo, F. Robert, B. Gordon, E. Fraenkel, T. Jaakkola, R. Young, and D. Gifford. Nature Biotechnology, 21(11) pp. 1337-42, 2003.

Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data, A. Tanay, R. Sharan, M. Kupiec, R. Shamir Proc. National Academy of Science USA 101 (9) 2981--2986, 2004.

Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types of Data, Bernard, A. & Hartemink, A. In Pacific Symposium on Biocomputing 2005 (PSB05),

References

Advanced Algorithms and Models for Computational Biologyepxing/Class/10810-06/lecture24.pdf ·...

Documents

Transcript of Advanced Algorithms and Models for Computational Biologyepxing/Class/10810-06/lecture24.pdf ·...