CS 5263 Bioinformatics

46
CS 5263 Bioinformatics Reverse-engineering Gene Regulatory Networks

description

CS 5263 Bioinformatics. Reverse-engineering Gene Regulatory Networks. Genes and Proteins. Gene (DNA). Transcriptional regulation. Transcription (also called expression). mRNA. mRNA degradation. Translational regulation. Translation. (De)activation. Protein. - PowerPoint PPT Presentation

Transcript of CS 5263 Bioinformatics

Page 1: CS 5263 Bioinformatics

CS 5263 Bioinformatics

Reverse-engineering Gene Regulatory Networks

Page 2: CS 5263 Bioinformatics

Genes and Proteins

Transcriptional regulation

Translational regulation

Post-translational regulation

mRNA degradation

Gene (DNA)

mRNA

Protein

Transcription (also called expression)

Translation

(De)activation

Page 3: CS 5263 Bioinformatics

Gene Regulatory Networks

• Functioning of cell controlled by interactions between genes and proteins

• Genetic regulatory network: genes, proteins, and their mutual regulatory interactions

gene 1

gene 2 gene 3

activator

repressor

repressor

Page 4: CS 5263 Bioinformatics

Reverse-engineering GRNs

• GRNs are large, complex, and dynamic• Reconstruct the network from observed gene expression

behaviors– Experimental methods focus on a few genes only– Computer-assisted analysis: large scale

• Since 1960s– Theoretical study mostly

• Attracting much attention since the invent of Microarray technology

• Emerging advanced large-scale assay techniques are making it even more feasible (ChIP-chip, ChIP-seq, etc.)

Page 5: CS 5263 Bioinformatics

Problem Statement

• Assumption: expression value of a gene depends on the expression values of a set of other genes

• Given: a set of gene expression values under different conditions

• Goal: a function for each gene that predicts its expression value from expression of other genes– Probabilistically: Bayesian network– Boolean functions: Boolean network– Linear functions: linear model– Other possibilities such as decision trees, SVMs

Page 6: CS 5263 Bioinformatics

Characteristics

• Gene expression data is often noisy, with missing values

• Only measures mRNA level– Many genes regulated not only on the

transcriptional level

• # genes >> # experiments. Underdetermined problem!!!!

• Correlation causality• Good news: Network structure is

sparse (scale-free)

Page 7: CS 5263 Bioinformatics

Methods for GRN inference

• Directed and undirected graphs– E.g. KEGG, EcoCyc

• Boolean networks– Kauffman (1969), Liang et al (1999), Shmulevich et al (2002),

Lähdesmäki et al (2003)• Bayesian networks

– Friedman et al (2000), Murphy and Mian (1999), Hartmink et al (2002)

• Linear/non-linear regression models– D’Haeseleer et al (1999), Yeung et al (2002)

• Differential equations– Chen, He & Church (1999)

• Neural networks– Weaver, Workman and Stormo (1999)

Page 8: CS 5263 Bioinformatics

Boolean Networks

• Genes are either on or off (expressed or not expressed)

• State of gene Xi at time t is a Boolean function of the states of some other genes at time t-1

X Y Z

X’ Y’ Z’

X Y Z X’ Y’ Z’

0 0 0 0 0 0

0 0 1 0 0 0

0 1 0 1 0 1

0 1 1 0 0 1

1 0 0 0 1 0

1 0 1 0 1 0

1 1 0 1 1 1

1 1 1 0 1 1

X’ = Y and (not Z)

Y’ = X

Z’ = Y

Page 9: CS 5263 Bioinformatics

Learning Boolean Networks for Gene Expression

• Assumptions:– Deterministic (wiring does not change)– Synchronized update– All Boolean functions are probable

• Data needed: 2N for N genes. (In comparison, N needed for linear models)

• General techniques: limit the # of inputs per gene (k). Data required reduced to 2k log(N).

Page 10: CS 5263 Bioinformatics

Learning Boolean Networks

• Consistency Problem– Given: Examples S: {<In, Out>}, where

• In {0,1}k, output {0,1}– Goal: learn Boolean function f such that for every <In, Out>

S, f(In) = out.– Note:

• Given the same input, the output is unique.• For k input variables, there are at most 2k distinct input

configurations. – Example:

<001,1> <101,1> <110,1> <010,0> <011,0> <101,0> 1,1 5,1 6,1 2,0 3,0 5,0

Page 11: CS 5263 Bioinformatics

Learning Boolean Networks

<001,1><101,1> <110,1> <010,0> <101,1><101,0>

?100?*1?

no clash -> consistency.

Question marks -> undetermined elements

O (Mk), M is # of experiments

N genes, Choose k from N,

N * C(N, k) * O(MK)

Best-fit problem: Find a function f with minimum # of errors

Limited error-size problem: Find all functions with error-size within max

Lähdesmäki et al, Machine Learning 2003;52: 147-167.

Page 12: CS 5263 Bioinformatics
Page 13: CS 5263 Bioinformatics

State space and attractor basins

Page 14: CS 5263 Bioinformatics

What are some biological interpretations of basins and attractors?

Page 15: CS 5263 Bioinformatics
Page 16: CS 5263 Bioinformatics

Linear Models

• Expression level of gene at time t depends linearly on the expression levels of some genes at time t-1

X1

X2

X3

X1

X2

X3

t-1 tW11

W21W31

W33

W32

W31

o Basic model: Xi (t) = Σj Wij Xj(t-1)

o Xi’ (t) = Σj Aij Xj(t), where Xi(t) can be measured, Xi’ (t) can be estimated from Xi(t)

o In matrix form: X’NM = ANN XNM , where M is the number of time points, N is the number of genes

Page 17: CS 5263 Bioinformatics

Linear Models (cont’d)

• X’NM = ANN ·XNM

• ANN: connectivity matrix, Aij describes the type and strength of the influence of the jth gene on the ith gene.

• To solve A, need to solve MN linear equations

• In general N2 >> MN, therefore under-determined => infinity number of solutions

Page 18: CS 5263 Bioinformatics

Get Around The Curse of Dimensionality

• Non-linear interpolation to increase # of time points

• Cluster genes to reduce # of genes• Singular Value Decomposition (SVD)

– A = A0 + CNN · VTNN, where cij = 0 if j > M

– Take A0 as a solution, guaranteed smallest sum of squares.

• Robust regression– Minimize # of edges in the network– Biological networks are sparse (scale-free)

Cij 0

CNN

Page 19: CS 5263 Bioinformatics

0

1

2

3

4

5

6

0 2 4 6

Robust Regression

• A = A0 + CNN · VTNN,

• Minimizing # of non-zero entries in A by selecting C– Set A = 0, then C · VT

= -A0 , solve for C.

– Over-determined. (N2 equations, MN free variables).

• Robust regression– Fit a hyper-plane to a set of points

by passing as many points as possible

Page 20: CS 5263 Bioinformatics

Simulation Experiments

SVD + Robust Regression SVD alone

Yeung et al, PNAS. 2002;99:6163-8.

Page 21: CS 5263 Bioinformatics

Simulation Experiments (cont’d)Linear System

Nonlinear System close to steady state

Does not work for nonlinear system not close to steady state

Scale-free property does not hold on small networks

Page 22: CS 5263 Bioinformatics

Bayesian Networks• A DAG G (V, E), where

– Vertex: a random variable – Edge: conditional distribution for a

variable, given its parents in G.

• Markov assumption: i, I (Xi, non-descendent(Xi) | PaG(Xi))e.g. I(X3, X4 | X2), I(X1, X5 | X3)

X1

X5

X4X3

X2

Chain rule: P(X1, X2, …, Xn) = Πi P(Xi | PaG(Xi), i = 1..n

P (X1, X2, X3, X4, X5) = P(X1) P(X2) P(X3 | X1, X2) P (X4 | X2) P(X5 | X3)

Learning: argmaxG P (G | D) = P (D | G) * P (G) / C

Page 23: CS 5263 Bioinformatics

Bayesian Networks (Cont’d)

• Equivalence classes of Bayesian Networks– Same topology, different edge

directions– Can not be distinguished from

observation• Causality

– Bayesian network does not directly imply causality

– Can be inferred from observation with certain assumptions:

• no hidden common cause• ……

A B

C

A B

C

I (A, B | C)

A B

C Hidden variable

A B

CPDAG

Page 24: CS 5263 Bioinformatics

Bayesian Networks for Gene Expression

• Deals with noisy data well, reflects stochastic nature of gene expression

• Indication of causality• Practical issues:

– Learning is NP-hard– Over-fitting– Equivalent classes of

graphs• Solution:

– Heuristic search, sparse candidate

– Model averaging– Learning partial models

Gene E

Gene D Gene A

Gene C

Gene B

Other variables can be added, such as promoters sequences, experiment conditions and time.

(D | E):Multinomialor linear

Page 25: CS 5263 Bioinformatics

Learning Bayesian Nets

• Find G to maximize Score (G | D), where– Score(G | D) = Σi Score (Xi, PaG(Xi) | D)

• Hill-climbing– Edge addition, edge removal, edge reversal

• Divide-and-conquer– Solve for sub-graphs

• Sparse candidate algorithm– Limit the number of candidate parents for each

variables. (Biological implications – sparse graph)– Iteratively modifying the candidate set

Page 26: CS 5263 Bioinformatics

Partial Models (Features)

A BA B

C

orA and B in some joint biological interaction• Order relations

A B… A is a cause of B

• Model Averaging– Learn many models, common sub-graphs will be more

likely to be true– Confidence measure: # of times a sub-graph appeared– Method: bootstrap

• Markov relations– A is in B’s Markov blanket iff

Page 27: CS 5263 Bioinformatics

Experimental Results

• Real biological data set: Yeast cell cycle data

• 800 genes, 76 experiments, 200-fold bootstrap

• Test for significance and robustness– More higher scoring

features in real data than in randomized data

– Order relations are more robust than Markov relations with respect to local probability models.

Markov Relations

Friedman et al, J Comput Biol. 2000;7:601-20

Page 28: CS 5263 Bioinformatics

Transcriptional regulatory network

• Who regulates whom?• When?• Where?• How?

GenePromoter

TF

A B g1

RNA-Pol A and not B

A B g2

RNA-PolA and B

A B g3

RNA-Pol A or B

A B g4

RNA-Pol Not (A and B)

PNAS 2003;100(9):5136-41

Page 29: CS 5263 Bioinformatics

Data-driven vs. model-driven methods

clustering

MF

Learning

Post-processingBiological insights

Descriptive

Explanatory, predictive

model model

“A description of a process that could have generated the observed data”

gene

condition

Page 30: CS 5263 Bioinformatics

Data-driven approaches

• Assumption– Co-expressed genes are likely co-regulated: not necessarily true

• Limitations:– Clustering is subjective– Statistically over-represented but non-functional “junk” motifs– Hard to find combinatorial motifs

Clustering Motif finding

Hierarchical, K-means, …

MEME, Gibbs, AlignACE, …

Experiments

Gen

es

Page 31: CS 5263 Bioinformatics

Model-based approaches

• Intuition: find motifs that are not only statistically over-represented, but are also associated with the expression patterns– E.g., a motif appears in many up-regulated genes but

very few other genes => real motif?• Model: gene expression = f (TF binding motifs, TF

activities)• Goal: find the function that

– Can explain the observed data and predict future data

– Captures true relationships among motifs, TFs and expression of genes

Page 32: CS 5263 Bioinformatics

Transcription modeling

g1

g2

g3

g4

g5

g6

g7

g8

Motifs ExpressionPromoters

? Genelabels

Variables

e = f (m1, m2, m3, m4)

Assume that gene expression levels under a certain condition are a function of some TF binding motifs on their promoters.

Page 33: CS 5263 Bioinformatics

Different modeling approaches

• Many different models, each with its own limitations

• Classification models– Decision tree, support vector machine (SVM),

naïve bayes, …

• Regression models– Linear regression, regression tree, …

• Probabilistic models– Bayesian networks, probabilistic Boolean

networks, …

Page 34: CS 5263 Bioinformatics

Decision treem1

m2

yes

m4

yes

no

yesno no

A B C D

3, 641, 2, 57, 8

g1

g2

g3

g4

g5

g6

g7

g8

e

• Tree structure is learned from data– Only relevant variables (motifs) are used– Many possible trees, the smallest one is preferred

• Advantages: – Easy to interpret– Can represent complex logic relationships

e = f (m1, m2, m3, m4)

m1 m2 m3 m4

Page 35: CS 5263 Bioinformatics

A real example: transcriptional regulation of yeast stress response

• 52 genes up-regulated in heat-shock (postive)• 156 random irresponsive genes (negative)• 356 known motifs

Small tree: only used 4 motifs

All 4 motifs are well-known to be stress-related

RRPE-PAC combination well-known

RRPE

PACFHL1

RAP1 11 (+)1(-)

4 (-)3 (+)

23 (+)

151 (-)10 (+)

5 (+)

Yes

YesYes

Yes

No

NoNo

No

Page 36: CS 5263 Bioinformatics

Model network in Science, 2002;298(5594):799-804

Network by our methodRuan et. al., BMC Genomics, 2009

Application to yeast cell-cycle genes

Page 37: CS 5263 Bioinformatics

Regression tree

• Similar to decision tree

• Difference: each terminal node predicts a range of real values instead of a label

m1

m2

yes

m4

no

no

yesno yes

e20>e>2e20<e<2

g1

g2

g3

g4

g5

g6

g7

g8

em1 m2 m3 m4

e = f (m1, m2, m3, m4)

Page 38: CS 5263 Bioinformatics

Multivariate regression tree• Multivariate labels: use multiple experiments

simultaneously• Use motifs to classify genes into co-expressed groups• Does not need clustering in advance

e1 e2 e3e4e5

m1

m2

yes

m4yes

no

yesno no

g1g2g3g4g5g6g7g8

m1 m2 m3 m4

368

125

4

7

Phuong,T., et. al., Bioinformatics, 2004

Page 39: CS 5263 Bioinformatics

Modeling with TF activities

• Gene expression = f (binding motifs, TF activities)

tf1tf2tf3tf4

e1 e2 e3 e4 e5

g

tf1 tf2 tf3 tf4

e1 e2 e3 e4 e5

g

rotate

tf1

> 0 0

g0 g>0

g = f (tf1, tf2, tf3, tf4)

Soinov et al., Genome Biol, 2003

Page 40: CS 5263 Bioinformatics

A Decision Tree Model

Segal et al. Nat Genet. 2003,34(2):166-76.

gene

experiment

A decision tree model of gene expressions

Page 41: CS 5263 Bioinformatics

Algorithm BDTree

• Gene expression = f (binding motifs, TF activities)

• Ruan & Zhang, Bioinformatics 2006• Basic idea:

– Iteratively partition an expression matrix by splitting genes or experiments

– Split of genes is according to motif scores– Split of conditions is according to TF

expression levels– The algorithm decides the best motifs or TFs

to use

Page 42: CS 5263 Bioinformatics

Transcriptional regulation of yeast stress response

• 173 experiments under ~20 stress conditions

• 1411 differentially expressed genes• ~1200 putative binding motifs

– Combination of ChIP-chip data, PWMs, and over-represented k-mers (k = 5, 6, 7)

• 466 TFs

Page 43: CS 5263 Bioinformatics

Genes

Exp

erim

ent

s

Genes with motifs FHL1 but no RRPE are down-regulated when Ppt1 is down-regulated and Yfl052w is up-regulated

Genes with motifs RRPE & PAC are down-regulated when TFs Tpk1 & Kin82 are up-regulated

Page 44: CS 5263 Bioinformatics

Biological validation

• Most motifs and TFs selected by the tree are well-known to be stress-related– E.g., motifs RRPE, PAC, FHL1, TFs Tpk1 and

Ppt1

• 42 / 50 blocks are significantly enriched with some Gene Ontology (GO) functional terms

• 45 / 50 blocks are significantly enriched with some experimental conditions

Page 45: CS 5263 Bioinformatics

RRPE & PAC, ribosome biogenesis (60/94, p < e-65)

FHL1, protein biosynthesis (98/105, p<e-87)

STRE (agggg)carbohydrate metabolism p < e-20

RRPE only, ribosome biogenesis (28/99, p < e-18)

Nitrogen metabolism

PAC

Page 46: CS 5263 Bioinformatics

Relationship between methods

• A, C: from promoter to expression– A: single cond– C: multi conds

• B, D: from expression to expression– B: single gene– D: multi genes

g1g2g3g4g5g6g7g8

m1 m2 m3 m4

t1t2t3t4

c1 c2 c3 c4 c5

A

C

B

D