HISPIG A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions...
-
Upload
rosalyn-stafford -
Category
Documents
-
view
222 -
download
0
description
Transcript of HISPIG A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions...
HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting
Regulatory Regions
Takuma [email protected]
Milton Taylor Laboratory• Using microarrays and bioinformatics techn
ologies to develop better treatments for HCV (Virahep-C project)– Only known treatment for HCV is treatment wit
h interferon-alpha (IFN-a), or more recently combination treatment of pegylated IFN-a and Ribavirin
– Interferons were discovered as proteins that inhibit virus replication, and are induced in mammalian cells in response to virus infection
PBMC Experiment• PBMC was isolated from group of healthy i
ndividuals, and treated with IFN-a alone, or with Ribavirin.
• By microarray experiment results, expression of large number of genes were either up-regulated or down-regulated – It was of interest to analyze the upstream region
of these genes for the presence of motifs (ISRE and GAS)
Goal of My Project• Build a computer model that effectively
searches ISRE and GAS sequences in human genes – ISRE/GAS both work as a promoter– ISRE drives the expression of most of type I
IFN stimulated genes (and some gamma)– GAS drives the expression of type II IFN
stimulated genes – Genes that contain ISRE / GAS express more
with IFN than ones that do not – Generalize to be able to search any motif in the
future
Type I IFN Signal Transduction
p48
ISRE
STAT2
IFN/
TYK2
JAK1
STAT1
HETERODIMER
P
Transcription
CYTOPLASM NUCLEUS
1 2
ISGF3
(IRF-9)
The Situation• We have a list of known motifs to refer to
– Numerous ISRE and GAS are known and published • We have sets of sequences from microarray
experiments that is– likely to contain motifs…S1 (up-regulated genes)– unlikely to contain motifs…S2 (down-regulated
genes, and random genes)• To detect motifs, build a model M(+) using the
list of known motifs– Occurrences of the model will be detected in both
S1 and S2
How to Solve• Still, it is difficult to accurately predict motifs
– Motifs are short in length, and also divergent– So, occurrences in S1 and S2 are difficult to
distinguish• We overcome this problem by a
discriminative model refinement approach– We make two models:
M(+)…from known motifsM(-)…from false motifs
– Iteratively refine the models, and separate the occurrences in S1 and S2
Methods Used
• HMMER
• Log-likelihood Method
• Both with iterative model refinement approach
HMMER• Detects ISRE and GAS sequences (up-regulat
ed genes, down-regulated genes and random genes)
1. Build a model with a list including known and functional motifs from journals by hmmbuild
hmm consensus sequence2. Parse promoter region of each gene3. Look for occurrences of the consensus within the
promoter region of the three gene groups by hmmsearch
Alignment File (.aln)• List of known motifs – as .aln file
• Example of ISRE:IP10 AGGTTTCACTTTCCAISG15 CAGTTTCGGTTTCCCFactor CAGTTTCTGTTTCCTTla TAGTTTCACTTTTTGGBP TACTTTCAGTTTCATISG20 ATCTTTGACTTTGTC
*** ***
Result for INDO gene (2 ISREs)Alignments of top-scoring domains:
INDO: domain 1 of 2, from 4901 to 4915: E = 0.0097 *-> g g g a a a . t g a a a c t a<-* + g a a a + t g a a a c + a INDO 4901 TAGAAA a TGAAACCA 4915
INDO: domain 2 of 2, from 5370 to 5384: E = 0.18 *-> g g g a a a . t g a a a c t a <-* g ++ a a + g a a a c t a
INDO 5370 TGAGAA a GGAAACTA 5384
negative strand
Iterative Model Refinement
ModelS1 :Sm+n
ModelS1 :Sm
1. look for more occurrences
2. rank the new sequences3. add top k sequences
ModelS1 :Sm+k
n sequences were significant
(may be functional)
But that is too many to add
Let’s add only relevant k sequences This is my
new model for next iteration
hmmsearch results (ISRE)
group iterations up-regulated random down-
regulated
e-val < 0.011 6 2 0
2 22 4 1
e-val < 0.11 53 11 16
2 82 25 28
hmmsearch results (GAS)
group iterations up-regulated random down-
regulated
e-val < 0.11 0 0 0
2 23 7 19
e-val < 0.31 9 2 7
2 72 37 52
Problems of hmmsearch• Number of significant motifs detected
– ISRE >>> GAS (in terms of e-value)• Cannot tell whether the detected motifs are
functional or not– E-value is the only measure
• GAS overlap between different gene groups– 25% between up-regulated and random
• As in previous slides, occurrences detected from the different gene groups are hard to distinguish
Log Likelihood Method• Calculate scores for each detected motif to tell
whether functional, and to discriminate gene groups– Score = log (M(+) / M(-))– M(+)… Known motifs, M(-)… False motifs– 1 pseudo count for each nucleotide per 10 sequences
• If the log-likelihood score for the given motif is– positive… the motif is functional if also have
significantly low e-value– negative… the motif is not functional
Concept of Models(+/-)
ISRE1 CAGTTT..ISRE2 TAGTTT..GAS1 TTTCAA..
List of known & functional motifs
Model(+)
ISRE1 TACTTT..ISRE2 AGGCTT..GAS1 TATGAA..
List of false positive motifs
Model(-)
1. build model
3. build model
2. search occurrences of M(+) in negative model
Base Composition Tweaking• All known functional ISRE has two “TTT”s
– Without tweaking, a motif with a “TTT” and a “TCC” will receive high log-likelihood score
• To solve this problem, we look for high percentage nucleotides, and make them dominant – Example: base composition of a certain column
AGCT
-3%-14%-12%-71%
AGCT
-0.1%-0.1%-0.1%-99.7%
tweak!
Model(+)S(+)1 :S(+)n
Iteration and Model Refinement
First iteration (model refinement)
Second iteration (model refinement)
Model(+)S(+)1 :S(+)n
Model(-)S(-)1 :S(-)n
Model(+)S(+)1 :S(+)n
Model(-)S(-)1 :S(-)n
Model(-)S(-)1 :S(-)n
Up-regulated vs. Random
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1 2 3
Iterations
Log-
likel
ihoo
d sc
ore
ISRE(positive)ISRE(negative)GAS(positive)GAS(negative)
up-regulated genes AVG
random genes AVG
Search Result of HISPIG• Numerous potentially functional ISRE and
GAS were detected from 100 most up-regulated genes (both known and unknown)– Approximately 80% of the genes had either
functional ISRE or GAS– Numerous genes contain unknown functional
motifs that match with other gene expression experiments previously shown in journals
• All motifs included in the model were concluded to be functional
Improvement of log-likelihood• Re-aligning process of model refinement
– Rank sequences that match criteria by1. e-value2. log-likelihood score3. both (not easy to implement algorithm)
– Convincing if 2. works better than others• Which model to refine each iteration
– Only positive? Only negative? Both?
Measuring the Reliability of the Program
• Best Way – Do wet lab experiments to see if a detected unknown motif is really functional
• Alternative1. Remove some known and functional
sequences from the initial model2. See if the program still detects those in
the end
Reliability Experiment (ISRE)gene name detected e-value log-likelihood result
INDO YES 0.23 4.28 FAIR
INDO YES 0.097 2.74 GOOD
ISG20 NO BAD
BF YES 0.057 5.90 GOOD
IFIT2 YES 0.011 5.88 GOOD
G1P3 YES 0.0033 5.06 GOOD
G1P3 YES 0.0039 5.54 GOOD
CXCL10 YES 0.43 4.31 FAIR
OAS1 YES 0.01 4.68 GOOD
Acknowledgements
Sun KimMilton TaylorStuart Young