HISPIG A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions...

HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting

Regulatory Regions

Takuma [email protected]

[email protected]

Milton Taylor Laboratory• Using microarrays and bioinformatics techn

ologies to develop better treatments for HCV (Virahep-C project)– Only known treatment for HCV is treatment wit

h interferon-alpha (IFN-a), or more recently combination treatment of pegylated IFN-a and Ribavirin

– Interferons were discovered as proteins that inhibit virus replication, and are induced in mammalian cells in response to virus infection

PBMC Experiment• PBMC was isolated from group of healthy i

ndividuals, and treated with IFN-a alone, or with Ribavirin.

• By microarray experiment results, expression of large number of genes were either up-regulated or down-regulated – It was of interest to analyze the upstream region

of these genes for the presence of motifs (ISRE and GAS)

Goal of My Project• Build a computer model that effectively

searches ISRE and GAS sequences in human genes – ISRE/GAS both work as a promoter– ISRE drives the expression of most of type I

IFN stimulated genes (and some gamma)– GAS drives the expression of type II IFN

stimulated genes – Genes that contain ISRE / GAS express more

with IFN than ones that do not – Generalize to be able to search any motif in the

future

Type I IFN Signal Transduction

p48

ISRE

STAT2

IFN/

TYK2

JAK1

STAT1

HETERODIMER

P

Transcription

CYTOPLASM NUCLEUS

1 2

ISGF3

(IRF-9)

The Situation• We have a list of known motifs to refer to

– Numerous ISRE and GAS are known and published • We have sets of sequences from microarray

experiments that is– likely to contain motifs…S1 (up-regulated genes)– unlikely to contain motifs…S2 (down-regulated

genes, and random genes)• To detect motifs, build a model M(+) using the

list of known motifs– Occurrences of the model will be detected in both

S1 and S2

How to Solve• Still, it is difficult to accurately predict motifs

– Motifs are short in length, and also divergent– So, occurrences in S1 and S2 are difficult to

distinguish• We overcome this problem by a

discriminative model refinement approach– We make two models:

M(+)…from known motifsM(-)…from false motifs

– Iteratively refine the models, and separate the occurrences in S1 and S2

HISPIG

http://beet.bio.indiana.edu/takuma/HISPIG.html

Methods Used

• HMMER

• Log-likelihood Method

• Both with iterative model refinement approach

HMMER• Detects ISRE and GAS sequences (up-regulat

ed genes, down-regulated genes and random genes)

1. Build a model with a list including known and functional motifs from journals by hmmbuild

hmm consensus sequence2. Parse promoter region of each gene3. Look for occurrences of the consensus within the

promoter region of the three gene groups by hmmsearch

Alignment File (.aln)• List of known motifs – as .aln file

• Example of ISRE:IP10 AGGTTTCACTTTCCAISG15 CAGTTTCGGTTTCCCFactor CAGTTTCTGTTTCCTTla TAGTTTCACTTTTTGGBP TACTTTCAGTTTCATISG20 ATCTTTGACTTTGTC

*** ***

Result for INDO gene (2 ISREs)Alignments of top-scoring domains:

INDO: domain 1 of 2, from 4901 to 4915: E = 0.0097 *-> g g g a a a . t g a a a c t a<-* + g a a a + t g a a a c + a INDO 4901 TAGAAA a TGAAACCA 4915

INDO: domain 2 of 2, from 5370 to 5384: E = 0.18 *-> g g g a a a . t g a a a c t a <-* g ++ a a + g a a a c t a

INDO 5370 TGAGAA a GGAAACTA 5384

negative strand

Iterative Model Refinement

ModelS1 :Sm+n

ModelS1 :Sm

1. look for more occurrences

2. rank the new sequences3. add top k sequences

ModelS1 :Sm+k

n sequences were significant

(may be functional)

But that is too many to add

Let’s add only relevant k sequences This is my

new model for next iteration

hmmsearch results (ISRE)

group iterations up-regulated random down-

regulated

e-val < 0.011 6 2 0

2 22 4 1

e-val < 0.11 53 11 16

2 82 25 28

hmmsearch results (GAS)

group iterations up-regulated random down-

regulated

e-val < 0.11 0 0 0

2 23 7 19

e-val < 0.31 9 2 7

2 72 37 52

Problems of hmmsearch• Number of significant motifs detected

– ISRE >>> GAS (in terms of e-value)• Cannot tell whether the detected motifs are

functional or not– E-value is the only measure

• GAS overlap between different gene groups– 25% between up-regulated and random

• As in previous slides, occurrences detected from the different gene groups are hard to distinguish

Log Likelihood Method• Calculate scores for each detected motif to tell

whether functional, and to discriminate gene groups– Score = log (M(+) / M(-))– M(+)… Known motifs, M(-)… False motifs– 1 pseudo count for each nucleotide per 10 sequences

• If the log-likelihood score for the given motif is– positive… the motif is functional if also have

significantly low e-value– negative… the motif is not functional

Concept of Models(+/-)

ISRE1 CAGTTT..ISRE2 TAGTTT..GAS1 TTTCAA..

List of known & functional motifs

Model(+)

ISRE1 TACTTT..ISRE2 AGGCTT..GAS1 TATGAA..

List of false positive motifs

Model(-)

1. build model

3. build model

2. search occurrences of M(+) in negative model

Base Composition Tweaking• All known functional ISRE has two “TTT”s

– Without tweaking, a motif with a “TTT” and a “TCC” will receive high log-likelihood score

• To solve this problem, we look for high percentage nucleotides, and make them dominant – Example: base composition of a certain column

AGCT

-3%-14%-12%-71%

AGCT

-0.1%-0.1%-0.1%-99.7%

tweak!

Model(+)S(+)1 :S(+)n

Iteration and Model Refinement

First iteration (model refinement)

Second iteration (model refinement)


Model(-)S(-)1 :S(-)n




Up-regulated vs. Random

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1 2 3

Iterations

Log-

likel

ihoo

d sc

ore

ISRE(positive)ISRE(negative)GAS(positive)GAS(negative)

up-regulated genes AVG

random genes AVG

Search Result of HISPIG• Numerous potentially functional ISRE and

GAS were detected from 100 most up-regulated genes (both known and unknown)– Approximately 80% of the genes had either

functional ISRE or GAS– Numerous genes contain unknown functional

motifs that match with other gene expression experiments previously shown in journals

• All motifs included in the model were concluded to be functional

Improvement of log-likelihood• Re-aligning process of model refinement

– Rank sequences that match criteria by1. e-value2. log-likelihood score3. both (not easy to implement algorithm)

– Convincing if 2. works better than others• Which model to refine each iteration

– Only positive? Only negative? Both?

Measuring the Reliability of the Program

• Best Way – Do wet lab experiments to see if a detected unknown motif is really functional

• Alternative1. Remove some known and functional

sequences from the initial model2. See if the program still detects those in

the end

Reliability Experiment (ISRE)gene name detected e-value log-likelihood result

INDO YES 0.23 4.28 FAIR

INDO YES 0.097 2.74 GOOD

ISG20 NO BAD

BF YES 0.057 5.90 GOOD

IFIT2 YES 0.011 5.88 GOOD

G1P3 YES 0.0033 5.06 GOOD

G1P3 YES 0.0039 5.54 GOOD

CXCL10 YES 0.43 4.31 FAIR

OAS1 YES 0.01 4.68 GOOD

Acknowledgements

Sun KimMilton TaylorStuart Young

HISPIG A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions...

Documents

Transcript of HISPIG A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions...