Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU...
-
Upload
stephen-brown -
Category
Documents
-
view
218 -
download
1
Transcript of Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU...
Promoter Prediction in E.coli using ANN
A.Krishnamachari
Bioinformatics Centre, JNU
Definition of Bioinformatics
• Systematic development and application of Computing and Computational solution techniques to biological data to investigate biological process and make novel observations
Research in BiologyResearch in Biology
OrganismFunctionsCellChromosomeDNASequences
General approach
Bioinformatics era
TF
TF -> Transcription Factor Sites
TSS
TSS->Transcription Start Sites
RBS
RBS -> Ribosome Binding sites
CDS
CDS - > Coding Sequence (or) Gene
intergenic
-10-35
Statement of the problem
• Given a set of known sequences pertaining to a specific biological feature , develop a computational method to search for new members or sequences
Computational Methods
• Pattern Recognition
• Pattern classification
• Optimisation Methods
Sequence Analysis AND
Prediction Methods• Consensus• Position Weight Matrix (or) Profiles• Machine Learning Methods
– Neural Networks– Markov Models– Support Vector Machines– Decision Tree– Optimization Methods
Statistical Learning
Consensus sequence TA TAA T
49%,54% and 58%
14 sites out of 291 sequences [Lisser and Margalitt]
Mismatches but which one?
Describing features using frequency matrices
Describing features using frequency matrices
• Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences
• Need to describe how often particular bases are found in particular positions in a sequence feature
Describing features using frequency matrices
Describing features using frequency matrices
• Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature
Frequency matrices (continued)Frequency matrices (continued)
• Three uses of frequency matrices– Describe a sequence feature– Calculate probability of occurrence of feature
in a random sequence– Calculate degree of match between a new
sequence and a feature
Frequency Matrices, PSSMs, and Profiles
Frequency Matrices, PSSMs, and Profiles
• A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores
• PSSMs also called Position Weight Matrixes (PWMs) or Profiles
Methods for converting frequency matrices to PSSMs
• Using log ratio of observed to expected
where m(j,i) is the frequency of character j observed at position i and f(j) is the overall frequency of character j (usually in some large set of sequences)
• Using amino acid substitution matrix (Dayhoff similarity matrix) [see later]
score(i) logm( j,i) / f ( j)
Finding occurrences of a sequence feature using a Profile• As with finding occurrences of a
consensus sequence, we consider all positions in the target sequence as candidate matches
• For each position, we calculate a score by “looking up” the value corresponding to the base at that position
Nucleotides
1 2 3 4 5
A x11 x21 x31 x41 x51
T x12 x22 x32 x42 x52
G x13 x23 x33 x43 x53
C x14 x24 x34 x44 x54
Positions (Columns in alignment)
TAGCT AGTGC x12 + x21 + x33 + x44 + x52
if is above a threshold it is a site
V1V1
Building a PSSMBuilding a PSSM
PSSM builder
Set of Aligned Sequence Features
Expected frequencies of each sequence element
PSSM
Searching for sequences related to a family with a PSSM
Searching for sequences related to a family with a PSSM
PSSM search
PSSM
Set of Sequences to search
Sequences that match above threshold
Threshold
Positions and scores of matches
PSSM builder
Set of Aligned Sequence Features
Expected frequencies of each sequence element
Consensus sequences vs.
frequency matrices
Consensus sequences vs.
frequency matrices• consensus sequence or a frequency
matrix which one to use?– If all allowed characters at a given position
are equally "good", use IUB codes to create consensus sequence
• Example: Restriction enzyme recognition sites
– If some allowed characters are "better" than others, use frequency matrix
• Example: Promoter sequences
Consensus sequences vs.
frequency matrices
Consensus sequences vs.
frequency matrices• Advantages of consensus sequences:
smaller description, quicker comparison
• Disadvantage: lose quantitative information on preferences at certain locations
Measure1
Measure2
Linear Classification Problems
Measure 1
Measure 2
Nonlinear Classification Problem
(Artificial) Neural Network
What Is A Neural Network ?
A computational construct based on biological neuron. A neural network can:
•Learn by adapting its synaptic weights to changes in the surrounding environments; •handle imprecise, fuzzy, noisy, and probabilistic information; and •generalize from known tasks or examples to unknown ones.
Artificial neural networks (ANNs) attempt to mimic some,or all of these characteristics.
Neural Network
• Characterised by:
- its pattern of connections between the neurons (Network Architecture)
- its method of determining the weights on the connections (training or learning algorithm)
Why Neural Network:Applications
-Little or no incomplete understanding of the problem to be solved (very little theory)
-Abundant data available
Neural Networks: Applications
• Pattern classification
• Speech synthesis and recognition
• Image compression
• Clustering
• Medical Diagnosis
• Manufacturing
Neural Network:Bioinformatics
• Binding sites prediction
• Protein Secondary Structure prediction
• Protein folds
• Micro array data clustering
• Gene prediction
Neural Networks
• Supervised Learning
• Unsupervised Learning
inputs Output
3
1
2
W1,3
W2,3
Layer 1 (input) Layer 2 (output)
Direction of information flow
Summation Operation
xi * wij=x1*w1+x2*w2+x3w3….+xnWnj
Thresholding functions
Output = 0 if x*w < TOutput = 1 if x*w >T
Threshold=0
1Output
0
Output1 1
T0
Input Output
Logistic Transfer function
Output =
1
1 + e -
W(k+1) = w(k)+ µ[T9k) – w(k)x(k)]x(k) for 0 ≤ k≤ N-1
Weight updates
Learning Concepts
• Generally – the target output is 1 for +ve– The target output is 0 for –ve– But practically (0.9, 0.1) combination
• Stopping criterion
Based on certain epochs or cycles
Based on certain error estimates
12
K-1
K
A T G C
Nucleic Acid
PositionIn a sequenceOf K nucleotides
let the following binary values represent each baseA="0001C="0010G="0100T="1000
thenG = 4A or C = "0011 = 3A,G or T = "1101 = 13etc.
0 0 10
0 1 0 0
0 1 0 0
1 0 0 0
1 A
2 G
K-1 G
K T
A T G C
Nucleic Acid
PositionIn a sequenceOf K nucleotides
Wi,j
P=50
N=50
TP + TN =100
Note: Training and Test sequences are fixed length
N=500
n=10
1 2 …………501 2 …………50
TRAININGC G T A G C T A T A G T G G GT T T A A A C C C A A G A A TT A T G G A A T T T G G A A GT T T A G G A T A G C A C A GG A T A A G G C C T A G A T AT T T A T G C A T G A G A T G
TEST
C C T G A A C T G A G A T G A T A T A T A A G T G A A A T T C C G
PredictionMethod
Output
Input
-1
Input Layer
Hidden layer
Output Layer
weight
A B(local) (global)
Error function
ExampleDisease A diagnosis:
x – gene expression data (vector of numbers)
f(x) – A positive / A negative (boolean 0/1)
<x,f(x)> - set of known values
NN
<x,f(x)>
Learning
NN
x
h(x)
Recognition
Evaluation Mechanism
• Sensitivity = TP
TP + FN
Specificity = TN
TN + FP
(TPxTN – FPxFN)
(TP+FP)x(FP+TN)x(TN+FN)x(FN+TP)
C =
Cross - Validation
• Benchmarking the Network Performance• Step:1
Divide the training set into “N” partitions• Step: 2
Train the “N-1” partitions and
Test the Left out • Step: 3
Evaluate the Performance
1. Nucleic Acids Res. 1992 Aug 25;20(16):4331-8.
An assessment of neural network and statistical approaches for prediction of E. coli promoter sites.
Horton PB, Kanehisa M.
2. Nucleic Acids Res. 1994 Jun 11;22(11):2158-65.
Analysis of E.coli promoter structures using neural networks.
Mahadevan I, Ghosh I.
Training back-propagation neural networks to define and detect DNA-binding sites.Nucleic Acids Res. 1991 Jan 25;19(2):313-8.
Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes.Nucleic Acids Res. 1992 Jul 11;20(13):3471-7.
A general procedure for locating and analyzing protein-binding sequence motifs in nucleic acids.Proc Natl Acad Sci U S A. 1998 Sep 1;95(18):10710-5.
Pedersen AG, Engelbrecht Investigations of Escherichia colipromoter sequences with artificial neural networks: new signals discovered upstream of the transcriptional startpoint.Proc Int Conf Intell Syst Mol Biol. 1995;3:292-9.
O’Neill MC
Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequencesSiu-wai Leung, Chris Mellish and Dave Robertson
R Hershberg, G Bejerano, A Santos-Zavaleta and H Margalit
PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptionalstart sites.
Nucleic Acids Research 2001, 29: 277
E.Coli PROMOTER DATA
Earlier Studies
• Data set corresponding to 17 spacing class
• Very high threshold values
Network
Model
Performance pBR322(CW) pBR322(CCW)
17-1 34/35 -5, 339,477,1584,1657,1970,4130
125,805,1021,1226,4278
17-2 33/35 …. ….
17-3 34/35
17-4 32/35
Poll(4/4) -5, 1584, 1970,4130
125,807,1226,4278
New New
New
New
NEW
Disadvantages of MLP
• i) Slow convergence,
• ii) Training relies heavily on the choice of the number of hidden layers and
• iii) Mode of prediction is generally based on high threshold values.
Improvements in Prediction
• Pre processing of the data based on DNA structure
• Clean Model
Structural Atlas of E.coli
: J Mol Biol. 2000 Jun 16;299(4):907-30.
A DNA structural atlas for Escherichia coli.
Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW.
1) The size of the positive data set may be increased by incorporating point mutations in non-sensitive positions
2) Negative data sets are generated by several ways i.e.
a) Shuffling or randomising the positive data set. This does not destroy the the correlations between the bases completely.
b) Using random sequences with a biased composition,
c) Extracting the sequences from gene coding segments.
3) For the learning phase, the number of positive and negative input vectors are not generally proportionate and there is no standard prescription
4) Convergence factor and predictive ability depend on the size and the number of input vectors .