Physical Mapping of DNA BIO/CS 471 – Algorithms for Bioinformatics.
CS 5263 Bioinformatics
description
Transcript of CS 5263 Bioinformatics
CS 5263 Bioinformatics
Reverse-engineering Gene Regulatory Networks
Genes and Proteins
Transcriptional regulation
Translational regulation
Post-translational regulation
mRNA degradation
Gene (DNA)
mRNA
Protein
Transcription (also called expression)
Translation
(De)activation
Gene Regulatory Networks
• Functioning of cell controlled by interactions between genes and proteins
• Genetic regulatory network: genes, proteins, and their mutual regulatory interactions
gene 1
gene 2 gene 3
activator
repressor
repressor
Reverse-engineering GRNs
• GRNs are large, complex, and dynamic• Reconstruct the network from observed gene expression
behaviors– Experimental methods focus on a few genes only– Computer-assisted analysis: large scale
• Since 1960s– Theoretical study mostly
• Attracting much attention since the invent of Microarray technology
• Emerging advanced large-scale assay techniques are making it even more feasible (ChIP-chip, ChIP-seq, etc.)
Problem Statement
• Assumption: expression value of a gene depends on the expression values of a set of other genes
• Given: a set of gene expression values under different conditions
• Goal: a function for each gene that predicts its expression value from expression of other genes– Probabilistically: Bayesian network– Boolean functions: Boolean network– Linear functions: linear model– Other possibilities such as decision trees, SVMs
Characteristics
• Gene expression data is often noisy, with missing values
• Only measures mRNA level– Many genes regulated not only on the
transcriptional level
• # genes >> # experiments. Underdetermined problem!!!!
• Correlation causality• Good news: Network structure is
sparse (scale-free)
Methods for GRN inference
• Directed and undirected graphs– E.g. KEGG, EcoCyc
• Boolean networks– Kauffman (1969), Liang et al (1999), Shmulevich et al (2002),
Lähdesmäki et al (2003)• Bayesian networks
– Friedman et al (2000), Murphy and Mian (1999), Hartmink et al (2002)
• Linear/non-linear regression models– D’Haeseleer et al (1999), Yeung et al (2002)
• Differential equations– Chen, He & Church (1999)
• Neural networks– Weaver, Workman and Stormo (1999)
Boolean Networks
• Genes are either on or off (expressed or not expressed)
• State of gene Xi at time t is a Boolean function of the states of some other genes at time t-1
X Y Z
X’ Y’ Z’
X Y Z X’ Y’ Z’
0 0 0 0 0 0
0 0 1 0 0 0
0 1 0 1 0 1
0 1 1 0 0 1
1 0 0 0 1 0
1 0 1 0 1 0
1 1 0 1 1 1
1 1 1 0 1 1
X’ = Y and (not Z)
Y’ = X
Z’ = Y
Learning Boolean Networks for Gene Expression
• Assumptions:– Deterministic (wiring does not change)– Synchronized update– All Boolean functions are probable
• Data needed: 2N for N genes. (In comparison, N needed for linear models)
• General techniques: limit the # of inputs per gene (k). Data required reduced to 2k log(N).
Learning Boolean Networks
• Consistency Problem– Given: Examples S: {<In, Out>}, where
• In {0,1}k, output {0,1}– Goal: learn Boolean function f such that for every <In, Out>
S, f(In) = out.– Note:
• Given the same input, the output is unique.• For k input variables, there are at most 2k distinct input
configurations. – Example:
<001,1> <101,1> <110,1> <010,0> <011,0> <101,0> 1,1 5,1 6,1 2,0 3,0 5,0
Learning Boolean Networks
<001,1><101,1> <110,1> <010,0> <101,1><101,0>
?100?*1?
no clash -> consistency.
Question marks -> undetermined elements
O (Mk), M is # of experiments
N genes, Choose k from N,
N * C(N, k) * O(MK)
Best-fit problem: Find a function f with minimum # of errors
Limited error-size problem: Find all functions with error-size within max
Lähdesmäki et al, Machine Learning 2003;52: 147-167.
State space and attractor basins
What are some biological interpretations of basins and attractors?
Linear Models
• Expression level of gene at time t depends linearly on the expression levels of some genes at time t-1
X1
X2
X3
X1
X2
X3
t-1 tW11
W21W31
W33
W32
W31
o Basic model: Xi (t) = Σj Wij Xj(t-1)
o Xi’ (t) = Σj Aij Xj(t), where Xi(t) can be measured, Xi’ (t) can be estimated from Xi(t)
o In matrix form: X’NM = ANN XNM , where M is the number of time points, N is the number of genes
Linear Models (cont’d)
• X’NM = ANN ·XNM
• ANN: connectivity matrix, Aij describes the type and strength of the influence of the jth gene on the ith gene.
• To solve A, need to solve MN linear equations
• In general N2 >> MN, therefore under-determined => infinity number of solutions
Get Around The Curse of Dimensionality
• Non-linear interpolation to increase # of time points
• Cluster genes to reduce # of genes• Singular Value Decomposition (SVD)
– A = A0 + CNN · VTNN, where cij = 0 if j > M
– Take A0 as a solution, guaranteed smallest sum of squares.
• Robust regression– Minimize # of edges in the network– Biological networks are sparse (scale-free)
Cij 0
CNN
0
1
2
3
4
5
6
0 2 4 6
Robust Regression
• A = A0 + CNN · VTNN,
• Minimizing # of non-zero entries in A by selecting C– Set A = 0, then C · VT
= -A0 , solve for C.
– Over-determined. (N2 equations, MN free variables).
• Robust regression– Fit a hyper-plane to a set of points
by passing as many points as possible
Simulation Experiments
SVD + Robust Regression SVD alone
Yeung et al, PNAS. 2002;99:6163-8.
Simulation Experiments (cont’d)Linear System
Nonlinear System close to steady state
Does not work for nonlinear system not close to steady state
Scale-free property does not hold on small networks
Bayesian Networks• A DAG G (V, E), where
– Vertex: a random variable – Edge: conditional distribution for a
variable, given its parents in G.
• Markov assumption: i, I (Xi, non-descendent(Xi) | PaG(Xi))e.g. I(X3, X4 | X2), I(X1, X5 | X3)
X1
X5
X4X3
X2
Chain rule: P(X1, X2, …, Xn) = Πi P(Xi | PaG(Xi), i = 1..n
P (X1, X2, X3, X4, X5) = P(X1) P(X2) P(X3 | X1, X2) P (X4 | X2) P(X5 | X3)
Learning: argmaxG P (G | D) = P (D | G) * P (G) / C
Bayesian Networks (Cont’d)
• Equivalence classes of Bayesian Networks– Same topology, different edge
directions– Can not be distinguished from
observation• Causality
– Bayesian network does not directly imply causality
– Can be inferred from observation with certain assumptions:
• no hidden common cause• ……
A B
C
A B
C
I (A, B | C)
A B
C Hidden variable
A B
CPDAG
Bayesian Networks for Gene Expression
• Deals with noisy data well, reflects stochastic nature of gene expression
• Indication of causality• Practical issues:
– Learning is NP-hard– Over-fitting– Equivalent classes of
graphs• Solution:
– Heuristic search, sparse candidate
– Model averaging– Learning partial models
Gene E
Gene D Gene A
Gene C
Gene B
Other variables can be added, such as promoters sequences, experiment conditions and time.
(D | E):Multinomialor linear
Learning Bayesian Nets
• Find G to maximize Score (G | D), where– Score(G | D) = Σi Score (Xi, PaG(Xi) | D)
• Hill-climbing– Edge addition, edge removal, edge reversal
• Divide-and-conquer– Solve for sub-graphs
• Sparse candidate algorithm– Limit the number of candidate parents for each
variables. (Biological implications – sparse graph)– Iteratively modifying the candidate set
Partial Models (Features)
A BA B
C
orA and B in some joint biological interaction• Order relations
A B… A is a cause of B
• Model Averaging– Learn many models, common sub-graphs will be more
likely to be true– Confidence measure: # of times a sub-graph appeared– Method: bootstrap
• Markov relations– A is in B’s Markov blanket iff
Experimental Results
• Real biological data set: Yeast cell cycle data
• 800 genes, 76 experiments, 200-fold bootstrap
• Test for significance and robustness– More higher scoring
features in real data than in randomized data
– Order relations are more robust than Markov relations with respect to local probability models.
Markov Relations
Friedman et al, J Comput Biol. 2000;7:601-20
Transcriptional regulatory network
• Who regulates whom?• When?• Where?• How?
GenePromoter
TF
A B g1
RNA-Pol A and not B
A B g2
RNA-PolA and B
A B g3
RNA-Pol A or B
A B g4
RNA-Pol Not (A and B)
PNAS 2003;100(9):5136-41
Data-driven vs. model-driven methods
clustering
MF
Learning
Post-processingBiological insights
Descriptive
Explanatory, predictive
model model
“A description of a process that could have generated the observed data”
gene
condition
Data-driven approaches
• Assumption– Co-expressed genes are likely co-regulated: not necessarily true
• Limitations:– Clustering is subjective– Statistically over-represented but non-functional “junk” motifs– Hard to find combinatorial motifs
Clustering Motif finding
Hierarchical, K-means, …
MEME, Gibbs, AlignACE, …
Experiments
Gen
es
Model-based approaches
• Intuition: find motifs that are not only statistically over-represented, but are also associated with the expression patterns– E.g., a motif appears in many up-regulated genes but
very few other genes => real motif?• Model: gene expression = f (TF binding motifs, TF
activities)• Goal: find the function that
– Can explain the observed data and predict future data
– Captures true relationships among motifs, TFs and expression of genes
Transcription modeling
g1
g2
g3
g4
g5
g6
g7
g8
Motifs ExpressionPromoters
? Genelabels
Variables
e = f (m1, m2, m3, m4)
Assume that gene expression levels under a certain condition are a function of some TF binding motifs on their promoters.
Different modeling approaches
• Many different models, each with its own limitations
• Classification models– Decision tree, support vector machine (SVM),
naïve bayes, …
• Regression models– Linear regression, regression tree, …
• Probabilistic models– Bayesian networks, probabilistic Boolean
networks, …
Decision treem1
m2
yes
m4
yes
no
yesno no
A B C D
3, 641, 2, 57, 8
g1
g2
g3
g4
g5
g6
g7
g8
e
• Tree structure is learned from data– Only relevant variables (motifs) are used– Many possible trees, the smallest one is preferred
• Advantages: – Easy to interpret– Can represent complex logic relationships
e = f (m1, m2, m3, m4)
m1 m2 m3 m4
A real example: transcriptional regulation of yeast stress response
• 52 genes up-regulated in heat-shock (postive)• 156 random irresponsive genes (negative)• 356 known motifs
Small tree: only used 4 motifs
All 4 motifs are well-known to be stress-related
RRPE-PAC combination well-known
RRPE
PACFHL1
RAP1 11 (+)1(-)
4 (-)3 (+)
23 (+)
151 (-)10 (+)
5 (+)
Yes
YesYes
Yes
No
NoNo
No
Model network in Science, 2002;298(5594):799-804
Network by our methodRuan et. al., BMC Genomics, 2009
Application to yeast cell-cycle genes
Regression tree
• Similar to decision tree
• Difference: each terminal node predicts a range of real values instead of a label
m1
m2
yes
m4
no
no
yesno yes
e20>e>2e20<e<2
g1
g2
g3
g4
g5
g6
g7
g8
em1 m2 m3 m4
e = f (m1, m2, m3, m4)
Multivariate regression tree• Multivariate labels: use multiple experiments
simultaneously• Use motifs to classify genes into co-expressed groups• Does not need clustering in advance
e1 e2 e3e4e5
m1
m2
yes
m4yes
no
yesno no
g1g2g3g4g5g6g7g8
m1 m2 m3 m4
368
125
4
7
Phuong,T., et. al., Bioinformatics, 2004
Modeling with TF activities
• Gene expression = f (binding motifs, TF activities)
tf1tf2tf3tf4
e1 e2 e3 e4 e5
g
tf1 tf2 tf3 tf4
e1 e2 e3 e4 e5
g
rotate
tf1
> 0 0
g0 g>0
g = f (tf1, tf2, tf3, tf4)
Soinov et al., Genome Biol, 2003
A Decision Tree Model
Segal et al. Nat Genet. 2003,34(2):166-76.
gene
experiment
A decision tree model of gene expressions
Algorithm BDTree
• Gene expression = f (binding motifs, TF activities)
• Ruan & Zhang, Bioinformatics 2006• Basic idea:
– Iteratively partition an expression matrix by splitting genes or experiments
– Split of genes is according to motif scores– Split of conditions is according to TF
expression levels– The algorithm decides the best motifs or TFs
to use
Transcriptional regulation of yeast stress response
• 173 experiments under ~20 stress conditions
• 1411 differentially expressed genes• ~1200 putative binding motifs
– Combination of ChIP-chip data, PWMs, and over-represented k-mers (k = 5, 6, 7)
• 466 TFs
Genes
Exp
erim
ent
s
…
Genes with motifs FHL1 but no RRPE are down-regulated when Ppt1 is down-regulated and Yfl052w is up-regulated
Genes with motifs RRPE & PAC are down-regulated when TFs Tpk1 & Kin82 are up-regulated
…
Biological validation
• Most motifs and TFs selected by the tree are well-known to be stress-related– E.g., motifs RRPE, PAC, FHL1, TFs Tpk1 and
Ppt1
• 42 / 50 blocks are significantly enriched with some Gene Ontology (GO) functional terms
• 45 / 50 blocks are significantly enriched with some experimental conditions
RRPE & PAC, ribosome biogenesis (60/94, p < e-65)
FHL1, protein biosynthesis (98/105, p<e-87)
STRE (agggg)carbohydrate metabolism p < e-20
RRPE only, ribosome biogenesis (28/99, p < e-18)
Nitrogen metabolism
PAC
Relationship between methods
• A, C: from promoter to expression– A: single cond– C: multi conds
• B, D: from expression to expression– B: single gene– D: multi genes
g1g2g3g4g5g6g7g8
m1 m2 m3 m4
t1t2t3t4
c1 c2 c3 c4 c5
A
C
B
D