Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension,...

40
Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences

Transcript of Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension,...

Page 1: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Advanced Tools and Algorithms

in Bioinformatics

Chittibabu Guda

Summer, 2004

UCSD Extension, Department of Biosciences

Page 2: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Today’s Topics

• Hidden Markov Models (HMMs)

• Predicting sub-cellular localization of proteins

• Predicting post-translation modification sites

• Using Standalone tools

• Current Trends in Bioinformatics

Page 3: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Hidden Markov Models

Page 4: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

HMMs for biological sequences

• Hidden Markov model is a statistical model and has been mostly developed for speech recognition.• The most popular use of HMM in molecular biology is as a ‘probabilistic profile’ of a protein family, which is called a profile HMM.• Apart from this, HMMs are also used for multiple sequence alignment, gene prediction (ORF finding), and protein structure prediction• Advantages are, statistically sound, no sequence ordering or gap penalties are required• Limitations are, large number of similar sequences are required to get good models

Page 5: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Stochastic modeling of biological sequences

For Example, Profile is a position-specific scoring matrix.

• Given this model the probability of CGGSV is:0.8 * 0.4 * 0.8* 0.6* 0.2 = 0.031

• Since multiplication of fractions is computationally expensive and prone to floating point errors, a transformation into the logarithmic world is used.

• The score is calculated by taking the logs of all amino acid probabilities and adding them up.ln(0.8) + ln(0.4) + ln(0.8) + ln(0.6) + ln(0.2) = -3.48

Page 6: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Stochastic modeling of biological sequences

But with this expression it is not possible to distinguish between the highly implausible sequence TGCT- - AGG and the consensus sequence ACAC - - ATC

Page 7: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

The HMM architecture

• S-start; E-end

• m- main state (matches/mismatches)

• i - insert state

• d - delete state

A C A - - - A T GT C A A C T A T CA C A C - - A G CA G A - - - A T CA C C G - - A T C

Page 8: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Parameters used in HMM building• Transition probability: Tij (average 0.333)

• Emission probability: Ei (average 0.05)

M N – F L SM N – F L SM N K Y L TM Q – W - T

m

i

d

m

• Since the probabilities are very small numbers, they are converted to log odds scores and added to get the overall probability score

Page 9: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Markov modeling of biological sequences

A C A - - - A T G T C A A C T A T CA C A C - - A G CA G A - - - A T CA C C G - - A T C

Page 10: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Markov modeling of biological sequences

P(s)*100A C A - - - A T G 3.3T C A A C T A T C 0.0075A C A C - - A G C 1.2A G A - - - A T C 3.3A C C G - - A T C 0.59A C A C - - A T C 4.7

P(ACACATC)= 0.047 Obtained by taking the productof probabilities for residues in each state and the transitions.

Page 11: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Sequence Alignment and Database Search using HMMER

Multiple Alignment

Build a Profile HMM

Database search

Multiple alignments

Query against Profile HMM database

(PFAM database)

Page 12: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

HMMSEARCH Results(on voltage-gated ion channel proteins database)

Page 13: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

PFAM http://pfam.wustl.edu

• Protein Family Database created using HMMs

• Pfam-A contains functionally annotated families (~7500)

• Pfam-B contains unannotated families (~107000)

• All protein sequences were clustered into families based on sequence identity

• For each family, non-redundant, full-domain seed members were selected to represent the family

• Seed multiple alignments were built using ClustalW and manual checking

• HMM models were built using hmmbuild (suite of programs called HMMER)

• Using these models more family members were added in an iterative process of adding new members to multiple alignment and updating the HMM Model until no more new members are found

Page 14: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

How to build and use Profile HMMs

• Get a family of seed sequences in multiple alignment

• Build a Hidden Markov Model using hmmbuild

• Use HMM as a query to find remote homologues in the sequence database using hmmsearch

• Add new sequences to the seed alignment using hmmalign and update the model, iteratively

• Get the consensus sequence of the model using hmmemit

• Query HMM with new query sequences to find if the sequences are related to the Model using hmmpfam

Page 15: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

SledgeHMMER web server

• Accessible at http://SledgeHMMER.sdsc.edu

• Pfam database is the largest protein functional domain database built by Hidden Markov Models

• This server provides quick access to pre-calculated Pfam results for 1.2 million (entire SP+TrEMBL databases) protein sequences

• Sequences are compared with PERL MD5 hexadecimal hashing methods

• Web server is implemented in PERL/CGI interface

Page 16: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Predicting sub-cellular localization of proteins

Page 17: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Different cellular compartments

(modified from Voet & Voet, Biochemistry; Weinheim, New York, Basel, Wiley-VCH 1992)

Page 18: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

• Based on amino acid composition

• Based on signal or target peptides

• PSORT

• TargetP

• Based on domain occurrence patterns

• MITOPRED

• Based on lexical analysis

Methods to predict sub-cellular location

Page 19: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Amino acid compositional differences in different sub-cellular locations

Page 20: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

PSORT (http://psort.ims.u-tokyo.ac.jp/)

• PSORT program works based on a comprehensive knowledge of protein sorting

• Different parameters relevant to different groups of species are determined

• Bacterial sequences

• N-terminal signal sequence (Positive - H region)/cleavage site

• Transmembrane segments

• Lipoprotein Analysis

• Amino Acid composition

Page 21: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

• Eukaryotic sequences (Yeast/Animal/Plant)

• N-terminal signal sequence (Positive-H region)/cleavage site

• Transmembrane segments and Membrane topology

• Mitochondrial targeting signals and AAC of NT-20 amino acids

• Nuclear localization signals (NLS)

• Peroxysome matrix targeting sequences (PTSs) (S/A/C)(K/R/H/)L

• Chloroplast targeting signals

• Endoplasmic Reticulum signals (KDEL or HDEL-yeast)

• Vesicular, liposomal, vacuolar proteins etc.

PSORT continued …

Page 22: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

MITOPRED (http://mitopred.sdsc.edu)

• A new method based on Pfam domain occurrence patterns, amino acid composition (AAC) and pI value differences between mitochondrial and non-mitochondrial proteins

• Eukaryotic cells have multiple compartments and hence a set of pathways are localized to a specific compartment. Thus, a protein family involved in a specific pathway is expected in a specific compartment

• A knowledge base is developed by studying the occurrence and co-occurrence patterns of different Pfam domain in different cellular compartments

• The method compares the Pfam domains found in the query sequence against the knowledge-base and assigns a score, depending on which compartment it belongs to

• Independent scores are calculated based on the AAC, pI values of the query sequence by comparing them to the average values in different locations

• Final prediction is based on the combined score from AAC, pI and Pfam scores

Page 23: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Comparison of AA composition across mitochondrial and cytoplasmic sequences

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

A C D E F G H I K L M N P Q R S T V W Y

Residues

Rel

ativ

e fr

eeq

uen

cies

M-sol C-sol

More in Cytoplasmic More in Mitochondrial

Page 24: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

0

2

4

6

8

10

CYT MIT NUC END EXC GOL PLA POX

Cellular Location

pI v

alue

pI value differences in different sub-cellular locations

Page 25: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Flowchart showing MITOPRED procedure

Page 26: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

MITOPRED Web Server

• Accessible at http://mitopred.sdsc.edu

• Implemented using PERL/CGI interface

• Pre-calculated predictions are available for all eukaryotic proteins from Swiss-prot and TrEmbl databases (~500000)

• Genome-scale predictions can be downloaded for yeast, C.elegans, Drosophila, human, mouse and Arabidopsis species

• Provides data for the Mitoproteome database accessible at http://www.mitoproteome.org

Page 27: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Prediction of sub-cellular location by lexical analysis

• Separate SP proteins into different sub-cellular classes based on annotation

• In each class, extract all unique keywords for each sequence

• The total # of keywords in all classes is equal to the feature space (N)

• Generate a binary vector for each sequence in each class where the length of the vector is equal to N, 1 if the keyword is present and 0 if its absent.

• For the Unknown protein, generate a binary vector similar to above, based on its key words. From this, generate sub-vectors of size 2k-1 (where k is equal to the number of key words in the unknown) by flipping the 1s to 0s.

• Based on the sub-vectors, retrieve all proteins with matching binary vectors from all classes.

• The unknown belongs to the class that contributes the most number of sequences in the retrieved group.

• This program works better, if the number of keywords are more as well as the family size is bigger.

Page 28: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Flow diagram of lexical analysis method (From Nair R, Rost Burkhard, Bioinformatics 18:S78-S86, 2000)

Page 29: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Predicting Post-translational Modification Sites of Proteins

Page 30: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

General Method for PTM site Prediction

• PROSITE provides consensus patterns for a lot of PTM sites, however in most cases these patterns are very short and the true modifications occur based on the structural or environmental context in the protein fold

• Because of this reason, methods based on reg expressions or local alignment methods produce large number of false positives

• In almost all methods used in PTM site prediction, artificial neural networks (ANNs) are used.

• General procedure:

• Prepare datasets experimentally-known to possess a type of PTM site

• Separate the dataset into training and testing data

• Train a network using training data and test it with the testset. This process is iterated until the model is well refined

• Sufficient number of training sequences and good quality data are important for the success of any neural network method

Page 31: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Different Post-translational modifications (PTMs)

• Glycosylation

• ASN(N)-glycosylation (NetNGlyc)

• O-glycosylation (NetOGlyc)

• Sulfation (Sulfinator)

• Phosphorylation (NetPhos)

• Myristoylation (NMT)

Page 32: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Prediction of Glycosylation Sites (NetNGlyc, NetOGlyc)

• Glycoproteins are specially synthesized molecules by covalent attachment of oligosaccharides to certain proteins at the ASN(N-glycosylation) or Ser or Thr (O-glycosylation) residues.

• These are usually exported to extra-cellular destinations like mucin in alimentary tract or glycoprotein harmones in the anterior pitutory gland.

• N-glycosylation

• O-glycosyltion

• No consensus pattern

• SEA domain is associated with it

Page 33: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Prediction of Sulfation Sites

• Protein tyrosine sulfation is an important post-translational modification for proteins that go through the secretory pathway. It regulates several protein-protein interactions and modulates the binding affinity of TM peptide receptors

• Based on the rules described above, HMMs could be trained to build models for predicting proteins sequences with patterns that abide these rules

Page 34: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Sulfinator Algorithm (http://us.expasy.org/tools/sulfinator/)

• Sulfinator employs four different HMMs to recognize N-terminal (HMM-N), Internal (HMM-I), C-terminal (HMM-C) and in Y-clusters (HMM-Y)

Page 35: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Prediction of Phosphorylation Sites (NetPhos (http://www.cbs.dtu.dk/services/NetPhos/)

• Protein kinases, a very large family of enzymes catalyze phosphorylation

• NetPhos produces neural network predictions for serine (S), threonine (T) or tyrosine (Y) phosphorylation sites in eukaryotic proteins that affect a multitude of cellular signaling processes

• Y-kinase Phosphorylation

• S or T-Phosphorylation in Caesin Kinase II

• Since these are very short patterns, the amino acids surrounding a phosphorylated residue are significant in determining whether a particular site is phosphorylated or not

Page 36: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Standalone Tools

Page 37: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Local Installation of tools and databases

• NCBI-Toolkit

• Formatting and using BLAST

• CD-HIT

• CLUSTALW

• HMMER package

Page 38: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Current Trends in Bioinformatics

Page 39: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Cell

StructureFunction

GenomicsTranscriptomicsProteomics

Metabolomics

Components Biology

Systems Biology

Reductionistic Approach

Integrative Approach

Page 40: Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Highway network system in San Antonio