CS 5263 Bioinformatics

CS 5263 Bioinformatics

Reverse-engineering Gene Regulatory Networks

Genes and Proteins

Transcriptional regulation

Translational regulation

Post-translational regulation

mRNA degradation

Gene (DNA)

mRNA

Protein

Transcription (also called expression)

Translation

(De)activation

Gene Regulatory Networks

• Functioning of cell controlled by interactions between genes and proteins

• Genetic regulatory network: genes, proteins, and their mutual regulatory interactions

gene 1

gene 2 gene 3

activator

repressor

repressor

Reverse-engineering GRNs

• GRNs are large, complex, and dynamic• Reconstruct the network from observed gene expression

behaviors– Experimental methods focus on a few genes only– Computer-assisted analysis: large scale

• Since 1960s– Theoretical study mostly

• Attracting much attention since the invent of Microarray technology

• Emerging advanced large-scale assay techniques are making it even more feasible (ChIP-chip, ChIP-seq, etc.)

Problem Statement

• Assumption: expression value of a gene depends on the expression values of a set of other genes

• Given: a set of gene expression values under different conditions

• Goal: a function for each gene that predicts its expression value from expression of other genes– Probabilistically: Bayesian network– Boolean functions: Boolean network– Linear functions: linear model– Other possibilities such as decision trees, SVMs

Characteristics

• Gene expression data is often noisy, with missing values

• Only measures mRNA level– Many genes regulated not only on the

transcriptional level

• # genes >> # experiments. Underdetermined problem!!!!

• Correlation causality• Good news: Network structure is

sparse (scale-free)

Methods for GRN inference

• Directed and undirected graphs– E.g. KEGG, EcoCyc

• Boolean networks– Kauffman (1969), Liang et al (1999), Shmulevich et al (2002),

Lähdesmäki et al (2003)• Bayesian networks

– Friedman et al (2000), Murphy and Mian (1999), Hartmink et al (2002)

• Linear/non-linear regression models– D’Haeseleer et al (1999), Yeung et al (2002)

• Differential equations– Chen, He & Church (1999)

• Neural networks– Weaver, Workman and Stormo (1999)

Boolean Networks

• Genes are either on or off (expressed or not expressed)

• State of gene Xi at time t is a Boolean function of the states of some other genes at time t-1

X Y Z

X’ Y’ Z’

X Y Z X’ Y’ Z’

0 0 0 0 0 0

0 0 1 0 0 0

0 1 0 1 0 1

0 1 1 0 0 1

1 0 0 0 1 0

1 0 1 0 1 0

1 1 0 1 1 1

1 1 1 0 1 1

X’ = Y and (not Z)

Y’ = X

Z’ = Y

Learning Boolean Networks for Gene Expression

• Assumptions:– Deterministic (wiring does not change)– Synchronized update– All Boolean functions are probable

• Data needed: 2N for N genes. (In comparison, N needed for linear models)

• General techniques: limit the # of inputs per gene (k). Data required reduced to 2k log(N).

Learning Boolean Networks

• Consistency Problem– Given: Examples S: {<In, Out>}, where

• In {0,1}k, output {0,1}– Goal: learn Boolean function f such that for every <In, Out>

S, f(In) = out.– Note:

• Given the same input, the output is unique.• For k input variables, there are at most 2k distinct input

configurations. – Example:

<001,1> <101,1> <110,1> <010,0> <011,0> <101,0> 1,1 5,1 6,1 2,0 3,0 5,0

Learning Boolean Networks

<001,1><101,1> <110,1> <010,0> <101,1><101,0>

?100?*1?

no clash -> consistency.

Question marks -> undetermined elements

O (Mk), M is # of experiments

N genes, Choose k from N,

N * C(N, k) * O(MK)

Best-fit problem: Find a function f with minimum # of errors

Limited error-size problem: Find all functions with error-size within max

Lähdesmäki et al, Machine Learning 2003;52: 147-167.

State space and attractor basins

What are some biological interpretations of basins and attractors?

Linear Models

• Expression level of gene at time t depends linearly on the expression levels of some genes at time t-1

X1

X2

X3

X1

X2

X3

t-1 tW11

W21W31

W33

W32

W31

o Basic model: Xi (t) = Σj Wij Xj(t-1)

o Xi’ (t) = Σj Aij Xj(t), where Xi(t) can be measured, Xi’ (t) can be estimated from Xi(t)

o In matrix form: X’NM = ANN XNM , where M is the number of time points, N is the number of genes

Linear Models (cont’d)

• X’NM = ANN ·XNM

• ANN: connectivity matrix, Aij describes the type and strength of the influence of the jth gene on the ith gene.

• To solve A, need to solve MN linear equations

• In general N2 >> MN, therefore under-determined => infinity number of solutions

Get Around The Curse of Dimensionality

• Non-linear interpolation to increase # of time points

• Cluster genes to reduce # of genes• Singular Value Decomposition (SVD)

– A = A0 + CNN · VTNN, where cij = 0 if j > M

– Take A0 as a solution, guaranteed smallest sum of squares.

• Robust regression– Minimize # of edges in the network– Biological networks are sparse (scale-free)

Cij 0

CNN

0

1

2

3

4

5

6

0 2 4 6

Robust Regression

• A = A0 + CNN · VTNN,

• Minimizing # of non-zero entries in A by selecting C– Set A = 0, then C · VT

= -A0 , solve for C.

– Over-determined. (N2 equations, MN free variables).

• Robust regression– Fit a hyper-plane to a set of points

by passing as many points as possible

Simulation Experiments

SVD + Robust Regression SVD alone

Yeung et al, PNAS. 2002;99:6163-8.

Simulation Experiments (cont’d)Linear System

Nonlinear System close to steady state

Does not work for nonlinear system not close to steady state

Scale-free property does not hold on small networks

Bayesian Networks• A DAG G (V, E), where

– Vertex: a random variable – Edge: conditional distribution for a

variable, given its parents in G.

• Markov assumption: i, I (Xi, non-descendent(Xi) | PaG(Xi))e.g. I(X3, X4 | X2), I(X1, X5 | X3)

X1

X5

X4X3

X2

Chain rule: P(X1, X2, …, Xn) = Πi P(Xi | PaG(Xi), i = 1..n

P (X1, X2, X3, X4, X5) = P(X1) P(X2) P(X3 | X1, X2) P (X4 | X2) P(X5 | X3)

Learning: argmaxG P (G | D) = P (D | G) * P (G) / C

Bayesian Networks (Cont’d)

• Equivalence classes of Bayesian Networks– Same topology, different edge

directions– Can not be distinguished from

observation• Causality

– Bayesian network does not directly imply causality

– Can be inferred from observation with certain assumptions:

• no hidden common cause• ……

A B

C

A B

C

I (A, B | C)

A B

C Hidden variable

A B

CPDAG

Bayesian Networks for Gene Expression

• Deals with noisy data well, reflects stochastic nature of gene expression

• Indication of causality• Practical issues:

– Learning is NP-hard– Over-fitting– Equivalent classes of

graphs• Solution:

– Heuristic search, sparse candidate

– Model averaging– Learning partial models

Gene E

Gene D Gene A

Gene C

Gene B

Other variables can be added, such as promoters sequences, experiment conditions and time.

(D | E):Multinomialor linear

Learning Bayesian Nets

• Find G to maximize Score (G | D), where– Score(G | D) = Σi Score (Xi, PaG(Xi) | D)

• Hill-climbing– Edge addition, edge removal, edge reversal

• Divide-and-conquer– Solve for sub-graphs

• Sparse candidate algorithm– Limit the number of candidate parents for each

variables. (Biological implications – sparse graph)– Iteratively modifying the candidate set

Partial Models (Features)

A BA B

C

orA and B in some joint biological interaction• Order relations

A B… A is a cause of B

• Model Averaging– Learn many models, common sub-graphs will be more

likely to be true– Confidence measure: # of times a sub-graph appeared– Method: bootstrap

• Markov relations– A is in B’s Markov blanket iff

Experimental Results

• Real biological data set: Yeast cell cycle data

• 800 genes, 76 experiments, 200-fold bootstrap

• Test for significance and robustness– More higher scoring

features in real data than in randomized data

– Order relations are more robust than Markov relations with respect to local probability models.

Markov Relations

Friedman et al, J Comput Biol. 2000;7:601-20

Transcriptional regulatory network

• Who regulates whom?• When?• Where?• How?

GenePromoter

TF

A B g1

RNA-Pol A and not B

A B g2

RNA-PolA and B

A B g3

RNA-Pol A or B

A B g4

RNA-Pol Not (A and B)

PNAS 2003;100(9):5136-41

Data-driven vs. model-driven methods

clustering

MF

Learning

Post-processingBiological insights

Descriptive

Explanatory, predictive

model model

“A description of a process that could have generated the observed data”

gene

condition

Data-driven approaches

• Assumption– Co-expressed genes are likely co-regulated: not necessarily true

• Limitations:– Clustering is subjective– Statistically over-represented but non-functional “junk” motifs– Hard to find combinatorial motifs

Clustering Motif finding

Hierarchical, K-means, …

MEME, Gibbs, AlignACE, …

Experiments

Gen

es

Model-based approaches

• Intuition: find motifs that are not only statistically over-represented, but are also associated with the expression patterns– E.g., a motif appears in many up-regulated genes but

very few other genes => real motif?• Model: gene expression = f (TF binding motifs, TF

activities)• Goal: find the function that

– Can explain the observed data and predict future data

– Captures true relationships among motifs, TFs and expression of genes

Transcription modeling

g1

g2

g3

g4

g5

g6

g7

g8

Motifs ExpressionPromoters

? Genelabels

Variables

e = f (m1, m2, m3, m4)

Assume that gene expression levels under a certain condition are a function of some TF binding motifs on their promoters.

Different modeling approaches

• Many different models, each with its own limitations

• Classification models– Decision tree, support vector machine (SVM),

naïve bayes, …

• Regression models– Linear regression, regression tree, …

• Probabilistic models– Bayesian networks, probabilistic Boolean

networks, …

Decision treem1

m2

yes

m4

yes

no

yesno no

A B C D

3, 641, 2, 57, 8

g1

g2

g3

g4

g5

g6

g7

g8

e

• Tree structure is learned from data– Only relevant variables (motifs) are used– Many possible trees, the smallest one is preferred

• Advantages: – Easy to interpret– Can represent complex logic relationships

e = f (m1, m2, m3, m4)

m1 m2 m3 m4

A real example: transcriptional regulation of yeast stress response

• 52 genes up-regulated in heat-shock (postive)• 156 random irresponsive genes (negative)• 356 known motifs

Small tree: only used 4 motifs

All 4 motifs are well-known to be stress-related

RRPE-PAC combination well-known

RRPE

PACFHL1

RAP1 11 (+)1(-)

4 (-)3 (+)

23 (+)

151 (-)10 (+)

5 (+)

Yes

YesYes

Yes

No

NoNo

No

Model network in Science, 2002;298(5594):799-804

Network by our methodRuan et. al., BMC Genomics, 2009

Application to yeast cell-cycle genes

http://www.sciencemag.org/content/vol298/issue5594/images/large/se4120971004.jpeg

Regression tree

• Similar to decision tree

• Difference: each terminal node predicts a range of real values instead of a label

m1

m2

yes

m4

no

no

yesno yes

e20>e>2e20<e<2

g1

g2

g3

g4

g5

g6

g7

g8

em1 m2 m3 m4

e = f (m1, m2, m3, m4)

Multivariate regression tree• Multivariate labels: use multiple experiments

simultaneously• Use motifs to classify genes into co-expressed groups• Does not need clustering in advance

e1 e2 e3e4e5

m1

m2

yes

m4yes

no

yesno no

g1g2g3g4g5g6g7g8

m1 m2 m3 m4

368

125

4

7

Phuong,T., et. al., Bioinformatics, 2004

Modeling with TF activities

• Gene expression = f (binding motifs, TF activities)

tf1tf2tf3tf4

e1 e2 e3 e4 e5

g

tf1 tf2 tf3 tf4

e1 e2 e3 e4 e5

g

rotate

tf1

> 0 0

g0 g>0

g = f (tf1, tf2, tf3, tf4)

Soinov et al., Genome Biol, 2003

A Decision Tree Model

Segal et al. Nat Genet. 2003,34(2):166-76.

gene

experiment

A decision tree model of gene expressions

Algorithm BDTree

• Gene expression = f (binding motifs, TF activities)

• Ruan & Zhang, Bioinformatics 2006• Basic idea:

– Iteratively partition an expression matrix by splitting genes or experiments

– Split of genes is according to motif scores– Split of conditions is according to TF

expression levels– The algorithm decides the best motifs or TFs

to use

Transcriptional regulation of yeast stress response

• 173 experiments under ~20 stress conditions

• 1411 differentially expressed genes• ~1200 putative binding motifs

– Combination of ChIP-chip data, PWMs, and over-represented k-mers (k = 5, 6, 7)

• 466 TFs

Genes

Exp

erim

ent

s

…

Genes with motifs FHL1 but no RRPE are down-regulated when Ppt1 is down-regulated and Yfl052w is up-regulated

Genes with motifs RRPE & PAC are down-regulated when TFs Tpk1 & Kin82 are up-regulated

…

Biological validation

• Most motifs and TFs selected by the tree are well-known to be stress-related– E.g., motifs RRPE, PAC, FHL1, TFs Tpk1 and

Ppt1

• 42 / 50 blocks are significantly enriched with some Gene Ontology (GO) functional terms

• 45 / 50 blocks are significantly enriched with some experimental conditions

RRPE & PAC, ribosome biogenesis (60/94, p < e-65)

FHL1, protein biosynthesis (98/105, p<e-87)

STRE (agggg)carbohydrate metabolism p < e-20

RRPE only, ribosome biogenesis (28/99, p < e-18)

Nitrogen metabolism

PAC

Relationship between methods

• A, C: from promoter to expression– A: single cond– C: multi conds

• B, D: from expression to expression– B: single gene– D: multi genes

g1g2g3g4g5g6g7g8

m1 m2 m3 m4

t1t2t3t4

c1 c2 c3 c4 c5

A

C

B

D

CS 5263 Bioinformatics

Documents

Transcript of CS 5263 Bioinformatics