Tools to analyze protein characteristics
description
Transcript of Tools to analyze protein characteristics
Tools to analyze protein characteristics
Protein sequence
-Family member-Multiple alignments
Identification of conserved regions
Evolutionary relationship (Phylogeny)
3-D fold model
Protein sorting and sub-cellular localization
Anchoring into the membrane
Signal sequence (tags)
Protein modifications
Some nascent proteins contain a specific signal, or targeting sequence that directs them to the correct organelle. (ER, mitochondrial, chloroplast, lysosome, vacuoles, Golgi, or cytosol)
Can we train the computers:To detect signal sequences and predict protein destination?To identify conserved domains (or a pattern) in proteins?To predict the membrane-anchoring type of a protein? (Transmembrane domain, GPI anchor…)To predict the 3D structure of a protein?
Learning algorithms are good for solving problems in pattern recognition because they can be trained on a sample data set.
Classes of learning algorithms:-Artificial neural networks (ANNs)-Hidden Markov Models (HMM)
Questions
Artificial neural networks (ANN)
Machine learning algorithms that mimic the brain. Real brains, however, are orders of magnitude more complex than any ANN.
ANNs, like people, learn by example. ANNs cannot be programmed to perform a specific task.
ANN is composed of a large number of highly interconnected processing elements (neurons) working simultaneously to solvespecific problems.
The first artificial neuron was developed in 1943 by the neurophysiologist Warren McCulloch and the logician Walter Pits.
Hidden Markov Models (HMM)
HMM is a probabilistic process over a set of states, in which the states are “hidden”. It is only the outcome that visible to the observer. Hence, the name Hidden Markov Model.
HMM has many uses in genomics:Gene prediction (GENSCAN)SignalPFinding periodic patterns
Used to answer questions like:What is the probability of obtaining a particular outcome?What is the best model from many combinations?
Expasy server (http://au.expasy.org) is dedicated to the analysis of protein sequences and structures.
The ExPASy (Expert Protein Analysis System)
Sequence analysis tools include: DNA -> Protein [Translate] Pattern and profile searches Post-translational modification and topology prediction Primary structure analysis Structure prediction (2D and 3D) Alignment
PredictProtein: A service for sequence analysis, and structure prediction http://www.predictprotein.org/newwebsite/submit.html
TMpred: http://www.ch.embnet.org/software/TMPRED_form.html
TMHMM: Predicts transmembrane helices in proteins (CBS; Denmark) http://www.cbs.dtu.dk/services/TMHMM-2.0/
big-PI : Predicts GPI-anchor site:http://mendel.imp.univie.ac.at/sat/gpi/gpi_server.html
DGPI: Predicts GPI-anchor site: http://129.194.185.165/dgpi/index_en.html
SignalP: Predicts signal peptide: http://www.cbs.dtu.dk/services/SignalP/
PSORT: Predicts sub-cellular localization: http://www.psort.org/
TargetP: Predicts sub-cellular localization: http://www.cbs.dtu.dk/services/TargetP/
NetNGlyc: Predicts N-glycosylation sites:http://www.cbs.dtu.dk/services/NetNGlyc/
PTS1: Predicts peroxisomal targeting sequences http://mendel.imp.univie.ac.at/mendeljsp/sat/pts1/PTS1predictor.jsp
MITOPROT: Predicts of mitochondrial targeting sequences http://ihg.gsf.de/ihg/mitoprot.html
Hydrophobicity: http://www.vivo.colostate.edu/molkit/hydropathy/index.html
http://www.cbs.dtu.dk/services/: prediction server
NetNGlyc: Predicts N-glycosylation sites: http://www.cbs.dtu.dk/services/NetNGlyc/NetPhos: Predicts phosphorylation of residues: http://www.cbs.dtu.dk/services/NetPhos/NetPhosK: Predicts recognition sites for specific kinases: http://www.cbs.dtu.dk/services/NetPhosK/NetAcet: N-terminal acetylation in eukaryotic proteins: http://www.cbs.dtu.dk/services/NetAcet/NetCGlyc: C-mannosylation sites in mammalian proteins
Multiple alignment
Used to do phylogenetic analysis:Same protein from different speciesEvolutionary relationship: history
Used to find conserved regionsLocal multiple alignment reveals conserved regionsConserved regions usually are key functional regionsThese regions are prime targets for drug developmentsProtein domains are often conserved across many species
Algorithm for search of conserved regions: Block maker: http://blocks.fhcrc.org/blocks/make_blocks.html
Multiple alignment tools
Free programs: Phylip and PAUP: http://evolution.genetics.washington.edu/phylip.html Phyml: http://atgc.lirmm.fr/phyml/
The most used websites : http://align.genome.jp/ http://prodes.toulouse.inra.fr/multalin/multalin.html http://www.ch.embnet.org/index.html (T-COFFEE and ClustalW)
ClustalW: Standard popular software It aligns 2 and keep on adding a new sequence to the alignment Problem: It is simply a heuristics.
Motif discovery: use your own motif to search databases: PatternFind: http://myhits.isb-sib.ch/cgi-bin/pattern_search
http://meme.nbcr.net/meme4_6_0/intro.html
Phylogenetic analysis
Phylogenetic treesDescribe evolutionary relationships between sequences
Major modes that drive the evolution: Point mutations modify existing sequences Duplications (re-use existing sequence) Rearrangement
Two most common methodsMaximum parsimonyMaximum likelihood
http://www.megasoftware.net/mega4/m_con_select.html
The most useful software:
Definitions Homologous:Have a common ancestor. Homology cannot be measured.
Orthologous: The same gene in different species . It is the result of speciation (common ancestral)
Paralogous: Related genes (already diverged) in the same species. It is
the result of genomic rearrangements or duplication
Determining protein Structure-Function
Direct measurement of structureX-ray crystallographyNMR spectroscopy
Site-directed mutagenesis
Computer modelingPrediction of structureComparative protein-structure modeling
Comparative protein-structure modeling
Goal:Construct 3-D model of a protein of unknown structure (target), based on similarity of sequence to
proteins of known structure (templates)
Blue: predicted model by PROSPECT
Red: NMR structure
Procedure:Template selectionTemplate–target alignmentModel buildingModel evaluation
The Protein 3-D Database
The Protein DataBase (PDB) contains 3-D structural data for proteins
Founded in 1971 with a dozen structures
As of June 2004, there were 25,760 structures in the database. All structures are reviewed for accuracy and data uniformity.
Structural data from the PDB can be freely accessed at http://www.rcsb.org/pdb/
80% come from X-ray crystallography16% come from NMR2% come from theoretical modeling
High-throughput methods
Most used websites for 3-D structure prediction
Protein Homology/analogY Recognition Engine (Phyre) at http://www.sbg.bio.ic.ac.uk/phyre/html/index.html
PredictProtein at http://www.predictprotein.org/newwebsite/submit.html
UCLA Fold Recognition at http://www.doe-mbi.ucla.edu/Services/FOLD/
Commercial bioinformatics softwares
CLC Genomics Workbench
Genomics:
454, Illumina Genome Analyzer and SOLiD sequencing data; De novo assembly of genomes of any size; Advanced visualization, scrolling, and zooming tools; SNP detection using advanced quality filtering;
Transcriptomics:
RNA-seq including paired data and transcript-level expression; Small RNA analysis; Expression profiling by tags;
Epigenetics:
Chromatin immunoprecipitation sequencing (ChIP-seq) analysis; Peak finding and peak refinement; Graph and table of background distribution; false discovery rate; Peak table and annotations;
VectorNTI:
Sequence analysis and illustration; restriction mapping; recombinant molecule design and cloning; in silico gel electrophoresis; synthetic biology workflows
AlignX:
BioAnnotator:
ContigExpress:
GenomBench
The bioinformatics not covered in this class
Comparative genomics and Genome browser:http://genome.lbl.gov/vista/index.shtmlhttp://www.sanger.ac.uk/resources/software/artemis/
Genome annotation:http://linux1.softberry.com/berry.phtmlhttp:// rast.nmpdr.org/
Metagenomics:http://metagenomics.anl.gov/
System biology tools.