Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension,...

Advanced Tools and Algorithms

in Bioinformatics

Chittibabu Guda

Summer, 2004

UCSD Extension, Department of Biosciences

Today’s Topics

• Hidden Markov Models (HMMs)

• Predicting sub-cellular localization of proteins

• Predicting post-translation modification sites

• Using Standalone tools

• Current Trends in Bioinformatics

Hidden Markov Models

HMMs for biological sequences

• Hidden Markov model is a statistical model and has been mostly developed for speech recognition.• The most popular use of HMM in molecular biology is as a ‘probabilistic profile’ of a protein family, which is called a profile HMM.• Apart from this, HMMs are also used for multiple sequence alignment, gene prediction (ORF finding), and protein structure prediction• Advantages are, statistically sound, no sequence ordering or gap penalties are required• Limitations are, large number of similar sequences are required to get good models

Stochastic modeling of biological sequences

For Example, Profile is a position-specific scoring matrix.

• Given this model the probability of CGGSV is:0.8 * 0.4 * 0.8* 0.6* 0.2 = 0.031

• Since multiplication of fractions is computationally expensive and prone to floating point errors, a transformation into the logarithmic world is used.

• The score is calculated by taking the logs of all amino acid probabilities and adding them up.ln(0.8) + ln(0.4) + ln(0.8) + ln(0.6) + ln(0.2) = -3.48

Stochastic modeling of biological sequences

But with this expression it is not possible to distinguish between the highly implausible sequence TGCT- - AGG and the consensus sequence ACAC - - ATC

The HMM architecture

• S-start; E-end

• m- main state (matches/mismatches)

• i - insert state

• d - delete state

A C A - - - A T GT C A A C T A T CA C A C - - A G CA G A - - - A T CA C C G - - A T C

Parameters used in HMM building• Transition probability: Tij (average 0.333)

• Emission probability: Ei (average 0.05)

M N – F L SM N – F L SM N K Y L TM Q – W - T

m

i

d

m

• Since the probabilities are very small numbers, they are converted to log odds scores and added to get the overall probability score

Markov modeling of biological sequences

A C A - - - A T G T C A A C T A T CA C A C - - A G CA G A - - - A T CA C C G - - A T C

Markov modeling of biological sequences

P(s)*100A C A - - - A T G 3.3T C A A C T A T C 0.0075A C A C - - A G C 1.2A G A - - - A T C 3.3A C C G - - A T C 0.59A C A C - - A T C 4.7

P(ACACATC)= 0.047 Obtained by taking the productof probabilities for residues in each state and the transitions.

Sequence Alignment and Database Search using HMMER

Multiple Alignment

Build a Profile HMM

Database search

Multiple alignments

Query against Profile HMM database

(PFAM database)

HMMSEARCH Results(on voltage-gated ion channel proteins database)

PFAM http://pfam.wustl.edu

• Protein Family Database created using HMMs

• Pfam-A contains functionally annotated families (~7500)

• Pfam-B contains unannotated families (~107000)

• All protein sequences were clustered into families based on sequence identity

• For each family, non-redundant, full-domain seed members were selected to represent the family

• Seed multiple alignments were built using ClustalW and manual checking

• HMM models were built using hmmbuild (suite of programs called HMMER)

• Using these models more family members were added in an iterative process of adding new members to multiple alignment and updating the HMM Model until no more new members are found

http://pfam.wustl.edu/

How to build and use Profile HMMs

• Get a family of seed sequences in multiple alignment

• Build a Hidden Markov Model using hmmbuild

• Use HMM as a query to find remote homologues in the sequence database using hmmsearch

• Add new sequences to the seed alignment using hmmalign and update the model, iteratively

• Get the consensus sequence of the model using hmmemit

• Query HMM with new query sequences to find if the sequences are related to the Model using hmmpfam

SledgeHMMER web server

• Accessible at http://SledgeHMMER.sdsc.edu

• Pfam database is the largest protein functional domain database built by Hidden Markov Models

• This server provides quick access to pre-calculated Pfam results for 1.2 million (entire SP+TrEMBL databases) protein sequences

• Sequences are compared with PERL MD5 hexadecimal hashing methods

• Web server is implemented in PERL/CGI interface

http://sledgehmmer.sdsc.edu/

Predicting sub-cellular localization of proteins

Different cellular compartments

(modified from Voet & Voet, Biochemistry; Weinheim, New York, Basel, Wiley-VCH 1992)

• Based on amino acid composition

• Based on signal or target peptides

• PSORT

• TargetP

• Based on domain occurrence patterns

• MITOPRED

• Based on lexical analysis

Methods to predict sub-cellular location

Amino acid compositional differences in different sub-cellular locations

PSORT (http://psort.ims.u-tokyo.ac.jp/)

• PSORT program works based on a comprehensive knowledge of protein sorting

• Different parameters relevant to different groups of species are determined

• Bacterial sequences

• N-terminal signal sequence (Positive - H region)/cleavage site

• Transmembrane segments

• Lipoprotein Analysis

• Amino Acid composition

http://psort.ims.u-tokyo.ac.jp/

• Eukaryotic sequences (Yeast/Animal/Plant)

• N-terminal signal sequence (Positive-H region)/cleavage site

• Transmembrane segments and Membrane topology

• Mitochondrial targeting signals and AAC of NT-20 amino acids

• Nuclear localization signals (NLS)

• Peroxysome matrix targeting sequences (PTSs) (S/A/C)(K/R/H/)L

• Chloroplast targeting signals

• Endoplasmic Reticulum signals (KDEL or HDEL-yeast)

• Vesicular, liposomal, vacuolar proteins etc.

PSORT continued …

MITOPRED (http://mitopred.sdsc.edu)

• A new method based on Pfam domain occurrence patterns, amino acid composition (AAC) and pI value differences between mitochondrial and non-mitochondrial proteins

• Eukaryotic cells have multiple compartments and hence a set of pathways are localized to a specific compartment. Thus, a protein family involved in a specific pathway is expected in a specific compartment

• A knowledge base is developed by studying the occurrence and co-occurrence patterns of different Pfam domain in different cellular compartments

• The method compares the Pfam domains found in the query sequence against the knowledge-base and assigns a score, depending on which compartment it belongs to

• Independent scores are calculated based on the AAC, pI values of the query sequence by comparing them to the average values in different locations

• Final prediction is based on the combined score from AAC, pI and Pfam scores

http://mitopred.sdsc.edu/

Comparison of AA composition across mitochondrial and cytoplasmic sequences

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

A C D E F G H I K L M N P Q R S T V W Y

Residues

Rel

ativ

e fr

eeq

uen

cies

M-sol C-sol

More in Cytoplasmic More in Mitochondrial

0

2

4

6

8

10

CYT MIT NUC END EXC GOL PLA POX

Cellular Location

pI v

alue

pI value differences in different sub-cellular locations

Flowchart showing MITOPRED procedure

MITOPRED Web Server

• Accessible at http://mitopred.sdsc.edu

• Implemented using PERL/CGI interface

• Pre-calculated predictions are available for all eukaryotic proteins from Swiss-prot and TrEmbl databases (~500000)

• Genome-scale predictions can be downloaded for yeast, C.elegans, Drosophila, human, mouse and Arabidopsis species

• Provides data for the Mitoproteome database accessible at http://www.mitoproteome.org

http://mitopred.sdsc.edu/

http://www.mitoproteome.org/

Prediction of sub-cellular location by lexical analysis

• Separate SP proteins into different sub-cellular classes based on annotation

• In each class, extract all unique keywords for each sequence

• The total # of keywords in all classes is equal to the feature space (N)

• Generate a binary vector for each sequence in each class where the length of the vector is equal to N, 1 if the keyword is present and 0 if its absent.

• For the Unknown protein, generate a binary vector similar to above, based on its key words. From this, generate sub-vectors of size 2k-1 (where k is equal to the number of key words in the unknown) by flipping the 1s to 0s.

• Based on the sub-vectors, retrieve all proteins with matching binary vectors from all classes.

• The unknown belongs to the class that contributes the most number of sequences in the retrieved group.

• This program works better, if the number of keywords are more as well as the family size is bigger.

Flow diagram of lexical analysis method (From Nair R, Rost Burkhard, Bioinformatics 18:S78-S86, 2000)

Predicting Post-translational Modification Sites of Proteins

General Method for PTM site Prediction

• PROSITE provides consensus patterns for a lot of PTM sites, however in most cases these patterns are very short and the true modifications occur based on the structural or environmental context in the protein fold

• Because of this reason, methods based on reg expressions or local alignment methods produce large number of false positives

• In almost all methods used in PTM site prediction, artificial neural networks (ANNs) are used.

• General procedure:

• Prepare datasets experimentally-known to possess a type of PTM site

• Separate the dataset into training and testing data

• Train a network using training data and test it with the testset. This process is iterated until the model is well refined

• Sufficient number of training sequences and good quality data are important for the success of any neural network method

Different Post-translational modifications (PTMs)

• Glycosylation

• ASN(N)-glycosylation (NetNGlyc)

• O-glycosylation (NetOGlyc)

• Sulfation (Sulfinator)

• Phosphorylation (NetPhos)

• Myristoylation (NMT)

Prediction of Glycosylation Sites (NetNGlyc, NetOGlyc)

• Glycoproteins are specially synthesized molecules by covalent attachment of oligosaccharides to certain proteins at the ASN(N-glycosylation) or Ser or Thr (O-glycosylation) residues.

• These are usually exported to extra-cellular destinations like mucin in alimentary tract or glycoprotein harmones in the anterior pitutory gland.

• N-glycosylation

• O-glycosyltion

• No consensus pattern

• SEA domain is associated with it

http://www.cbs.dtu.dk/services/NetNGlyc

http://www.cbs.dtu.dk/services/NetOGlyc

Prediction of Sulfation Sites

• Protein tyrosine sulfation is an important post-translational modification for proteins that go through the secretory pathway. It regulates several protein-protein interactions and modulates the binding affinity of TM peptide receptors

• Based on the rules described above, HMMs could be trained to build models for predicting proteins sequences with patterns that abide these rules

Sulfinator Algorithm (http://us.expasy.org/tools/sulfinator/)

• Sulfinator employs four different HMMs to recognize N-terminal (HMM-N), Internal (HMM-I), C-terminal (HMM-C) and in Y-clusters (HMM-Y)

http://us.expasy.org/tools/sulfinator/

Prediction of Phosphorylation Sites (NetPhos (http://www.cbs.dtu.dk/services/NetPhos/)

• Protein kinases, a very large family of enzymes catalyze phosphorylation

• NetPhos produces neural network predictions for serine (S), threonine (T) or tyrosine (Y) phosphorylation sites in eukaryotic proteins that affect a multitude of cellular signaling processes

• Y-kinase Phosphorylation

• S or T-Phosphorylation in Caesin Kinase II

• Since these are very short patterns, the amino acids surrounding a phosphorylated residue are significant in determining whether a particular site is phosphorylated or not

http://www.cbs.dtu.dk/services/NetPhos/

Standalone Tools

Local Installation of tools and databases

• NCBI-Toolkit

• Formatting and using BLAST

• CD-HIT

• CLUSTALW

• HMMER package

Current Trends in Bioinformatics

Cell

StructureFunction

GenomicsTranscriptomicsProteomics

Metabolomics

Components Biology

Systems Biology

Reductionistic Approach

Integrative Approach

Highway network system in San Antonio

Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension,...

Documents

Transcript of Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension,...