The Bioinformatics of Protein Modification · SVMs: training data →function (classification /...

36
The Bioinformatics of Protein Modification (Part 2) Vorlesung 4610 Universität Basel http://www.biozentrum.unibas.ch/lectures.html Dr. Michael Rebhan, Friedrich Miescher Institute, Basel, January 2006 www.fmi.ch

Transcript of The Bioinformatics of Protein Modification · SVMs: training data →function (classification /...

Page 1: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

The Bioinformaticsof Protein Modification

(Part 2)

Vorlesung 4610Universität Basel

http://www.biozentrum.unibas.ch/lectures.html

Dr. Michael Rebhan, Friedrich Miescher Institute,

Basel, January 2006

www.fmi.ch

Page 2: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

1. Introduction: what role does bioinformatics play?

2. Mining information related to protein modifications- known modifications- finding proteins with particular modifications

3. Predicting modification sites in proteins:- general concepts- filtering and interpretation- generic tools- modification-specific tools and issues

- building your own motif

4. Related topics:- protein function- mutation effects

5. Online Materials: Exercises, Links

Part 2

Page 3: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites:

Building Your Own Motif:

1. Building the data set

2. Alignment

3. Analysis of the alignment

4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Page 4: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

- SRS @ ExPASy: SWISSPROT

- Specialized datasets? → online materials

(PubMed, Google)

Keep in mind:- how reliable is the data? (direct evidence?)- importance of the sequence environment around the mainmotif (see part 1)→ can reduce false positive rate

Eisenhaber et al(2004) Proteomics 4, 1614-1625.Prediction of sequence signals for lipid post-translational modifications: Insights from case studies

Page 5: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

- SRS @ ExPASy: SWISSPROT

Example: “C-linked (man)” in the “feature descriptions”(= C-mannosylation)

→ only those with direct exper. evidence!(is the dataset large & diverse enough?)

Page 6: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

- SRS @ ExPASy: SWISSPROT

Example: “C-linked (man)” in the “feature descriptions”

Features look OK→ query is OK(no preditions etc.)

Now get more info,incl. sequence environment

Page 7: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

- SRS @ ExPASy: SWISSPROT

Example: “C-linked (man)” in the “feature descriptions”

Back to the query form:

Retrieve entry instead of feature, and displaykey fields in output.

Page 8: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

- SRS @ ExPASy: SWISSPROT

Example: “C-linked (man)” in the “feature descriptions”

Why 11? We had 49features before?

(each entry (=protein)can carry a number offeatures (=modifications))

Click on the entry link…(if you’d like to include this protein)

Page 9: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

1. Find the featuresyou’d like to include in the data set (“training set”)

2. Click on its positionto get thesequence context

3. Build the alignment in FASTA format(by copy & paste, if it’s a small set)

4. Import into alignment viewers(like Jalview, www.jalview.org)

Page 10: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Analysis of the alignment / data set:- any corrections needed, esp. gaps?- is it large/diverse enough?- sorting, try different color views:

In Jalview: By conservation:- which positions showclear constraints?

→ motif boundaries

Other constraints:

- conserved? (“BLAST”)- secondary

structure, accessibility?(Quick2D, SABLE)

… see part 1

Color: Zappo

Page 11: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

Regular expressions:

[WDMLYSFHQ]-[TGSAYF]-[QSGCTNEPA]-W-[TGSAI]-[SCGPTVEDQ]-[CW]-[SGEDRANTF]

or: W-X-X-[CW] (in S-rich env.)

→ could be useful, but doesn’t impose a lot of constraints(and no scoring…)

If you’d like to use it anyway, you can scan proteindatabases with this motif at ScanProsite (ExPASy)…

Page 12: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

Which kind of model to use?

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

ScanProsite:

→ enter pattern, options

Page 13: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

ScanProsite results:

More: online materials

Page 14: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST!- support vector machines (SVMs)

Search with the alignment using PSI-BLAST, e.g. at the Bioinformatics Toolkit (MPI Tuebingen)→ PSSM profile (see part 1)

Page 15: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

First against SWISSPROT to check which proteins get the highest scores

→ e value: 1000, ungapped alignment

“Validation” / filtering:- Quick2D: secondary structure, disorder- conservation (?)

Also: ScanSite (MIT)!(enhanced regular expressions and PSSM search)

Page 16: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

SVMs: training data → function(classification / regression)

AutoMotif server (using SVMs)

Need:- reformat sequences (with a simple

replace, e.g. in WordPad)- register at the AutoMotif site (immediate)- submit reformatted alignment & search

For classification, SVMs operate by finding a hypersurface in the space of possible inputs. This hypersurface will attempt to split the positive examplesfrom the negative examples. The split will be chosen to have the largest distancefrom the hypersurface to the nearest of the positive and negative examples.

Page 17: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

My dataset is very small and not very diverse– anything I can do?

Collecting & aligning orthologs:

1. Check SWISSPROT for “by similarity” features, and, if that’s not enough,use myHits (SIB) to collect orthologs with considerable variation

(lots of flanking sequence, use 90% identity clustering, againstSWISSPROT [and Ensembl], E values 1e-6 and 0.01 select clear hits, then “next cycle”, then align trustworthy hits)

2. Trim the alignment in Jalview (e.g. in myHits), sort by pairwise id.

Demo with MARRSVLYFILLNALINKGQACFCDHYAWTQWTSCSKTCNSGTQSRHRQIVVDKYYQENF

Page 18: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Do all these orthologsstill carry the samemodification?

→ experiments!

Search: PSI-BLAST at MPI(as before)

(this example: 2 C-mannosyl.sites next to eachother)

Which residues are conserved?

Page 19: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

If there are no substrates at all – anything I can do?

Your have a kinase, by chance?

→ PREDIKIN: potential substrates for different kinds of kinases, based on sequence and type

→ ideas for experiments …

Brinkworth et al. (2003) PNAS 100:74

Page 20: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

Need advice?

Ask a protein sequence analysis expert

Page 21: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

SUMMARYBuilding your own motif

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

• Building your own motif is not as hard as you may think

• The main issue: building a good and informative alignment!

• Motif building & search:

• Regular expressions: ScanProsite

• PSSMs: PSI-BLAST at MPI

• SVMs: AutoMotif

Page 22: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

1. Introduction: what role does bioinformatics play?

2. Mining information related to protein modifications- known modifications- finding proteins with particular modifications

3. Predicting modification sites in proteins:- general concepts- filtering and interpretation- generic tools- modification-specific tools and issues - building your own motif

4. Related topics:- protein function prediction- mutation effects

5. Online Materials: Exercises, Links

Overview

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Page 23: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Protein Function Prediction:

Predicting modifications in the context of function prediction

Also:

- Protein isoforms and the prediction of modifications

- Interpretation of potential motifications, e.g. phospho-sites

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Page 24: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Protein function prediction:

Prediction modifications in the context of function prediction

What can be (reliably) predicted from the sequence alone?

• Domain architecture (and signal peptides): → potential molecular interactions→ proteins with similar domain architecture

• Tertiary or secondary structure, disorder & accessibility

• Small motifs: targeting, modifications, transmembrane regions, coiled coils

• Genomic context & phylogenetic occurrence: hints on “functional interactions”

• New predictions are coming out all the time …

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

MARRSVLYFI LLNALINKGQ ACFCDHYAWT QWTSCSKTCN SGTQSRHRQI VVDKYYQENF CEQICSKQET RECNWQRCPI NCLLGDFGPW SDCDPCIEKQ SKVRSVLRPS QFGGQPCTEP

Page 25: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Protein function prediction: our sequence, alternative transcripts

How good/complete is the protein sequencewe want to check?

- is the sequence itself reliable?- is it as complete as we think?- alternative transcripts?

→ Quick check:BLAT at UCSC

In this example (translated ORF):- some exons are missing!

(alternatively spliced)- alternative TSS exists

→ pick a better sequence!(maybe run the predictions on both & compare)

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Page 26: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Protein function prediction:

Predicting modifications in the context of function prediction

Domain architecture, signal peptide & low complexity regions: PFAM, Interpro→ molecular interactions (if you’re lucky), e.g. RNA-binding→ proteins with similar domain architecture (or composition): PFAM, SMART

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Low complexity

Signalpeptide

Page 27: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Protein function prediction: Prediction modifications in the context of function prediction

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

MARRSVLYFI LLNALINKGQ ACFCDHYAWT QWTSCSKTCN SGTQSRHRQI VVDKYYQENF CEQICSKQET RECNWQRCPI NCLLGDFGPW SDCDPCIEKQ SKVRSVLRPS QFGGQPCTEP

Page 28: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Protein function prediction: Prediction modifications in the context of function prediction

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

MARRSVLYFI LLNALINKGQ ACFCDHYAWT QWTSCSKTCN SGTQSRHRQI VVDKYYQENF CEQICSKQET RECNWQRCPI NCLLGDFGPW SDCDPCIEKQ SKVRSVLRPS QFGGQPCTEP

Small motifs: targeting, modifications, transmembrane regions

• Modifications → part 1

• Targeting: TargetP (part of ProtFun, see part 1)

• Disorder, secondary structure, coiled coils etc: Quick2D (at MPI)

• Transmembrane regions: TMHMM, also: Quick2D, SABLE

Quick2D output

Page 29: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Protein function prediction: Prediction modifications in the context of function prediction

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Transmembrane Regions: TMHMM (at CBS), in ProtFun

Page 30: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Protein function prediction: Prediction modifications in the context of function prediction

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Genomic context & phylogenetic occurrence:

STRING at EMBL:

Which interactions are supported by different methods?

Page 31: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Protein function prediction:

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Protein isoforms and the prediction of modifications

BLAT at UCSC → alternative transcripts → protein isoforms

Also: check SWISSPROT!

Do they show differences in their potential modification sites?(How could that affect function?)

e.g. SWISSPROT:TAU_HUMAN (pos. 30-120)

Page 32: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Protein function prediction:

Interpretation of potential motifications

Predicted phosphorylation sites → protein-protein interactions?

→ ScanSite at MIT (see part 1)

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Page 33: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

SUMMARYPrediction of modification sites in the context of protein function prediction

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

• Prediction of protein modifications is often/best done in the context of protein function prediction (comprehensive protein annotation)

• Many kinds of signals can be found in such sequences, and oftenthey can provide interesting hypotheses

• Any isoform-specific things? (modifications?)

• Functional consequences of the modification? (e.g. phospho-sites)

• Synergy between analyses! (e.g. structure → modification sites → evolution)

Reviews:- F. Eisenhaber (2005) Eurekah Bioscience Collection (at NCBI Books)

and the online “recipe” at http://mendel.imp.univie.ac.at/RECIPE/- J. Bienkowska (2005) Expert Rev. Proteomics 2:129- B. Rost (2003) Cell.Mol.Life Sci. 60:2637

Page 34: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Mutation Effects:

Will a mutation / polymorphism (e.g. SNP) weaken/destroy the potential modification site, or even create a new one?

Example: NetPhosK analysis of p53_HUMAN cancer variants (pos. 151)→ some modification sites disappear, others appear!

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

wt

Blom et al. (2004) Proteomics 4:1633

Page 35: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

1. Introduction: what role does bioinformatics play?

2. Mining information related to protein modifications- known modifications- finding proteins with particular modifications

3. Predicting modification sites in proteins:- general concepts- filtering and interpretation- generic tools- modification-specific tools and issues - building your own motif

4. Related topics:- protein function- mutation effects- analysis of mass spectrometry data

5. Online Materials: Exercises, Links

Overview

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Page 36: The Bioinformatics of Protein Modification · SVMs: training data →function (classification / regression) AutoMotif server (using SVMs) Need: - reformat sequences (with a simple

Online Materials: Exercises, Links

1. Protein Function & Structure

2. Modifications: Generic Tools

3. Modification-specific Tools

4. Building Your Own Motif

5. Recommended Materials

6. Exercises

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

http://www.fmi.ch/groups/bioinformatics/ptm/bioinfo.ptm.htm