Popitam,

Popitam, une méthode tolérante aux

mutations/modifications pour l'identification de protéines à

partir de données de

spectrométrie de masse (MS/MS)

Patricia Hernandez Swiss Institute of Bioinformatics

Overview

- proteomics- proteome- proteome visualization: 2D gels

- protein identification- classical workflow- shared peak count

- modifications and identification- modified peptides- SPC- spectral alignment, de novo sequencing, tag extraction

- Popitam- overview- tags- scoring function, genetic programming- some results

Proteome

--> Proteomics: science that studies proteins expressed by a genome --> proteome

--> changes with the state of development, the tissue or the environmental conditions

--> identification and quantification--> 3D structure prediction--> localisation in the cell--> biological function --> modifications --> interactions with other proteins

...

proteomics

--> a simple way to "see" a proteome

--> numerous proteins from a biological sample (example: blood) are separated according to 2 criteria :

molecular weight of the proteinisoelectric point

--> this method allows separating simultaneously thousands of proteins and displaying them on a two-dimensional map

--> spot = (generally) one purified protein

--> we can "see" the proteins, but we don't know to which protein corresponds a given spot...

2d gels

proteomics

Spots identification: classical workflow

select an unknown purified protein

cut the aa chain into peptides (every K and R aa)

measure the mass of the peptides by ms

MS/MS identification

--> identify a spot = give a protein name to a spot

--> protein databases (for example SwissProt)- records all known proteic

sequences- annotated

MSidentification

(PMF)

select a peptide

fragmentit

measure the mass of the fragments by ms

MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVKTHGTSSQATTSSQK…

MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVK...

MGMGQ MGQGWAWATWATA...

protein identification

MS: virtually cut the theo. seq. into peptides and compute masses

MS spectrum: list of the masses of peptides that constitute the protein of interestMS/MS spectrum: list of masses of fragments that constitute a peptide of the protein of interest

protein database

hbb_human

compare the list of experimental and theoretical masses in order to find the best match between experimental and virtual spectra--> detection--> ions--> noise

Shared peak count

MS/MS: virtually cut the theo. seq. into peptides, and further cut the peptides into fragments, and compute the masses

protein identification

p i g

Modified peptides (1)

PTMs--> most eukaryote proteins--> addition of a chemical group : --> participate to:

CONFLICT (different sources report differing sequences)--> in about 4'600 human entries

VARIANT (authors report that sequence variants exist) = alleles--> in about 2'200 human entries

MUTATIONS associated with diseases--> 187 references to mutations and diseases in COMMENTS section

modifications and identification

The sequence of the database may differ from the experimental peptide:

- methylation:+14- phosphroylation:+80- glycosylation: >800 ...- proteic structures

- proteic functions - control of metabolic pathways

Modified peptides (2)

MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVKTHGTSSQATTSSQK…

PEPYK

PYK

EPYK

PEP

a modified protein

digestionMS, selection of the peptide

fragmentation

m/z

intensity

m/z

intensity


SPC and modified peptides

"Shared peak count" algorithms have to introduce modifications into the theoretical peptide databases.

m/z

intensity

m/z

intensity

m/z

intensity

m/z

intensity

modified experimental MS/MS spectrum

experimental MS/MS spectrum

theoretical peptide


AAIEGKLMQRAPALK


Database size (1)

New database, if the two following modifications are taken into account- modification occurring on amino acid A: A->a- modification occurring on amino acids L: L->l and E: E->e

= all the peptide from the initial database, plus all modified peptides that can be built from the

initial database

AAIEGKLMQRAPALK

AAIEGKaAIEGKAaIEGKaaIEGKAAIeGKaAIeGKAaIeGKaaIeGKLMQRlMQR APALKaPALKAPaLKaPaLKAPAlKaPAlKAPalKaPalK

B(L,p,k) gives the probability to have k positions of modification in a sequence of lenght L, if p is the probability that a position may be modified

(we assume the positions to be independent)

Aim: assess the number of peptides that contain zero, one, two... "positions" for a possible modification

kLkk

Lk ppnCkpLnN 1),,(B

k

kkNn

max

0

L = 10, p = 1/20:800'000 = 478'990 + 252'100 + 59'710 + 8'380 + 771 + cL= 10, p= 5/20:800'000 = 45'050 + 150'169 + 225'254 + 200'225 + 116'798 + c

N0

N1

N2

xxxx

oxxxxoxxxxoxxxxo

ooxxoxoxoxxoxooxxoxoxxoo


Database size (2)

Expected number s of peptides that may contain exactly M modifications

Expected size of database when taking into account 0 to M modifications

maxL

Mkk

MkM NCs

M

iitoM ss

00

N0

N1

N2

xxxx

oxxxxoxxxxoxxxxo

ooxxooxxoxoxoxox...


Database size (3)

SwissProt Human, 10'000 proteins n = 806'787 peptides [300,3000] (=~from 3 to 30

aa) L = 11 amino acids

0 to 3 modifications occuring on one specific amino acid: p=1/20P0to3_mod = 1'375'700 + c

0 to 3 modifications that may occur on several loci:Phosphorylation: H,D,S,T,Y (eucaryotes): p = 5/20P0to3_mod = 4'865'100 + c

0 to 3 modifications that may occur on every amino acid: p=1P0to3_mod = 3,97e12 + c

Mutation scenario: Each amino acid may mutate into one of the remaining 19 amino acids:All possible words = 19k-1P1_mut = 1.16e14


Database size (3)

2 major problems:- size of the database- a priori knowledge on the deltaMass due to the modification

Solutions:Define an identification algorithm that is not based on a SPC

--> spectral convolution/alignment- PEDENTA (2000)

--> de novo sequencing followed by sequence matching- extraction of one or several complete sequences LUTEFISK (1997), SHERENGA (1999)...- extraction of one or several small tags (PeptideSearch, 1994), Patchwork sequencing...

--> Popitam (2003): "guided" sequencing


Other strategies

Spectral convolution/alignment

SPC score: D(k=0) = 2SA score: D(k=2) = 6

Pevzner PA, Dancik V, Tang CL: Mutation-tolerant protein identification by mass spectrometry. J.Comput.Biol. 2000, 7:777-787

A

B

C

D

F

A B C D E F

E

theo. MS/MS spectrum

exp.

MS/M

S s

pect

rum

Key idea: k-similarity D(k)Given Sexp and Stheo, the goal is to find a serie of k shifts in Sexp that makes Sexp and Stheo as similar as possible.D(k) represents the maximum number of elements in common between a theoretical and an experimental spectrum after k shifts

),()'',(

max)(jiji

ij kD1)('' kD ji

1)1('' kD ji

if (i',j') and (i,j) are co-diagonalotherwise


De novo sequencing

Taylor JA, Johnson RS: Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun.Mass Spectrom. 1997, 11:1067-1075

4/24

Longest path problem in a directed acyclic graph --> dynamic programming--> complete sequences --> mutations, but no modifications


Tag extraction

Mann M, Wilm M: Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal.Chem. 1994, 66:4390-4399

Schlosser A, Lehmann WD: Patchwork peptide sequencing: Extraction of sequence information from accurate mass data of peptide tandem mass spectra recorded at high resolution*. Proteomics. 2002, 2:524-533

Island of sequence ionsThe tags (m1-SEQ-m2) are manually extracted2 steps: tags as filtering, then SPC

Based on very accurate masses (10 mDa)Small tags are extracted from low mass regions (2 aa)


Popitam key's idea

Spectrum graph--> good way to structure the information

contained in the MS/MS spectrum, allows mutations

Tags--> modified source peptides--> fragmented spectra

Search space--> use dtb information during tag extraction --> take into account only mutations compatible

with the spectrum (graph)--> make only modification scenarios compatible

with the current theoretical peptide

Scoring function --> take into account a lot of parameters--> genetic programming

Popitam

Popitam overview

filter

Peptide sequencedatabase

any source of biological sequences

P1

P2

...

IDENTIFICATION

I(P1)

I(P2)

...

7/12

MS/MS

For each Pi

extractTags(); processTags(); score();

Popitam

initial node

final node

Spectrum graph

“N-term”: bMass = chargeNb * m/z – (chargeNb-1) – offset

“C-term”: bMass = PM – […]

measured mass [m/z]

bMass (ideal fragmentation)

- selection based on intensity- for each peak, make all possible hypotheses

b+-NH3

b+y++

a+-H20

- # nodes > # peaks- families

5/12

Popitam

Tag extraction

peLTEpeLetpeLvmpeITEpeIetpeIvmpetlE

LTELetLvmITEIetIvmtlE

ckTEetvmgoEV

Popitam

9 nodes,11 edges--> 21 tags

AIGGGLSSVGGSSTIK (1159 peaks)

1 16/97 5.6*104 0m02s2 30/338 5.4*106 0m27s3 44/692 5.7*107 3m16s4 58/1121 3.4*108 21m09s5 72/1667 2.3*109 2h17m07s

AHFSISNSAEDPFIAIHADSK(145 peaks)

1 24/121 6.1*104 0m02s2 46/308 1.9*108 16m15s3 68/831 2.0*1010 22h06m47s

LVNELTEFAK (125 peaks)

Tag extraction (2)

Pentium, 1.6 GHz

Popitam

Tag extraction (3)

ACCACMCAK-

CACMCAK

k

A

MCAK

MCAKCA

CMCAKCACMCAK

C k

Recursively extract from the graph all tags that are compatible with the current theoretical peptide

--> a tag = a path (bMass, edge label, ionic hypothesis…)

k

MCAK

Popitam

KplALVYGE 30 39 43 45 50 58 63 64 68 plALVYGE 39 43 45 50 58 63 64 68

ALVYGE 43 45 50 58 63 64 68 LVYGE 45 50 58 63 64 68 VYGE 50 58 63 64 68 YGE 58 63 64 68

paLKplALvy 0 4 10 16 22 26 31 42 LKplALvy 4 10 16 22 26 31 42 KplALvy 10 16 22 26 31 42 plALvy 16 22 26 31 42 ALvy 22 26 31 42

LKPla 10 13 19 22 31 LKPla 10 14 19 22 31 KPla 13 19 22 31 KPla 14 19 22 31

Tag processing

- discard subtags- discard tags that begin the theo. peptide, but not the graph (and vice versa)- discard tags that finish on the last aa, but not on the last node- group "family" tags

PLAlv 29 35 40 42 48 LAlv 35 40 42 48 DpaL 65 69 78 84 LKP 11 15 20 24 LVY 16 19 24 29 LVY 44 49 57 62PAL 19 22 26 31 QDP 10 16 20 24alkpL 54 63 71 75 avVqd 0 5 9 18 dpAL 37 43 45 50 avVQD 55 60 65 70 75 VQD 60 65 70 75paLK 59 66 69 75

AVVQDPALKPLALVYGEATSRPeakNb : 1260ParentMass : 2197.15NodeNb : 86EdgeNb : 142 / 1098

29 tags --> 13 subSeqs

Popitam

Aim:

Find all possible arrangements of subsequences, given the theoretical peptideBUTdo not include in a same arrangement tags that are incompatible with the others.

Compatibility rules:

--> no peak shared

--> beginMasses must respect positions in the sequences

Subsequence processing (1)

0 KplALVYGE 794.41 0 1 2 6 15 19 21 27 30 1 LKPla 282.17 2 7 29 33 41 2 PLAlv 785.34 6 8 19 21 28 3 DpaL 1673.89 14 20 31 36 4 LKP 284.11 17 22 32 36 5 LVY 410.26 14 22 28 29 ...

A V V Q D P A L K P L A L V Y G E A T S R0 5 10 15

0 1 2 3 4 5 ... 0 x x 1 x x 2 x x 3 x x x x 4 x 5 x x x ...

Compatibility graph

Each found clique in the graph is a possible arrangement of subsequencesHere, 91 cliques, but most of them are really uninteresting.

Popitam

Scoring function (1)

AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 LVY 1202.7

AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 LVY 1202.7 avVqd 1.0

AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 avVqd 1.0

AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LVY 1202.7

...

--> 2 levels scoring:

- scoring linked to the subsequences (local)

subscores:number of tags that compose the subsequence

length of the subsequence

occurrence probabilities of the ionic type hypothesized (geometric/arithmetic mean)

- scoring linked to the arrangement (global)

subscores:global coverage

linear regression

Popitam

Scoring function (2)

How can we combine the subscores in order to build an efficient scoring function ?

--> empirical function (expert knowledge)--> probabilitic function

--> function built using GENETIC PROGRAMMING

population of "programs" : trees nodes : mathematic operators (+, -, *, /, ^, ...) bolean operators (AND, OR, NOT...)

conditional operators (if-then-else...)iterative functions (do-until...)other specific functions...

leaves : subscores, coefficient

Popitam

GENETIC PROGRAMMING

Genetic operators (1)

Initiation:Programs are initially randomly determined (structure, functions, values)

Iterations:At each iteration, the programs are evaluated (fitness function). Only the best are allowed to reproduce, using genetic operators (permutation, mutation, crossing-over...).

Popitam

Genetic operators (2)Popitam

Genetic programming

fitness

tree population

fitness

genetic programming allows testing several scoring functions and making them "cleverly" evolve in order to find an optimal one

fitness

Popitam

Popitam

Popitam

Popitam

scoring function1

scoring function3scoring

function2

if (correctId() ) si ]0.5;1[ (according to the discriminative power)else { if (belongToList() ) si ]0;0.5] (according to the position in the list) else si = 0;

Some resultsPopitam

Popitam,

Documents

Transcript of Popitam,