Popitam,
description
Transcript of Popitam,
Popitam, une méthode tolérante aux
mutations/modifications pour l'identification de protéines à
partir de données de
spectrométrie de masse (MS/MS)
Patricia Hernandez Swiss Institute of Bioinformatics
Overview
- proteomics- proteome- proteome visualization: 2D gels
- protein identification- classical workflow- shared peak count
- modifications and identification- modified peptides- SPC- spectral alignment, de novo sequencing, tag extraction
- Popitam- overview- tags- scoring function, genetic programming- some results
Proteome
--> Proteomics: science that studies proteins expressed by a genome --> proteome
--> changes with the state of development, the tissue or the environmental conditions
--> identification and quantification--> 3D structure prediction--> localisation in the cell--> biological function --> modifications --> interactions with other proteins
...
proteomics
--> a simple way to "see" a proteome
--> numerous proteins from a biological sample (example: blood) are separated according to 2 criteria :
molecular weight of the proteinisoelectric point
--> this method allows separating simultaneously thousands of proteins and displaying them on a two-dimensional map
--> spot = (generally) one purified protein
--> we can "see" the proteins, but we don't know to which protein corresponds a given spot...
2d gels
proteomics
Spots identification: classical workflow
select an unknown purified protein
cut the aa chain into peptides (every K and R aa)
measure the mass of the peptides by ms
MS/MS identification
--> identify a spot = give a protein name to a spot
--> protein databases (for example SwissProt)- records all known proteic
sequences- annotated
MSidentification
(PMF)
select a peptide
fragmentit
measure the mass of the fragments by ms
MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVKTHGTSSQATTSSQK…
MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVK...
MGMGQ MGQGWAWATWATA...
protein identification
MS: virtually cut the theo. seq. into peptides and compute masses
MS spectrum: list of the masses of peptides that constitute the protein of interestMS/MS spectrum: list of masses of fragments that constitute a peptide of the protein of interest
protein database
hbb_human
compare the list of experimental and theoretical masses in order to find the best match between experimental and virtual spectra--> detection--> ions--> noise
Shared peak count
MS/MS: virtually cut the theo. seq. into peptides, and further cut the peptides into fragments, and compute the masses
protein identification
p i g
Modified peptides (1)
PTMs--> most eukaryote proteins--> addition of a chemical group : --> participate to:
CONFLICT (different sources report differing sequences)--> in about 4'600 human entries
VARIANT (authors report that sequence variants exist) = alleles--> in about 2'200 human entries
MUTATIONS associated with diseases--> 187 references to mutations and diseases in COMMENTS section
modifications and identification
The sequence of the database may differ from the experimental peptide:
- methylation:+14- phosphroylation:+80- glycosylation: >800 ...- proteic structures
- proteic functions - control of metabolic pathways
Modified peptides (2)
MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVKTHGTSSQATTSSQK…
PEPYK
PYK
EPYK
PEP
a modified protein
digestionMS, selection of the peptide
fragmentation
m/z
intensity
m/z
intensity
modifications and identification
SPC and modified peptides
"Shared peak count" algorithms have to introduce modifications into the theoretical peptide databases.
m/z
intensity
m/z
intensity
m/z
intensity
m/z
intensity
modified experimental MS/MS spectrum
experimental MS/MS spectrum
theoretical peptide
modifications and identification
AAIEGKLMQRAPALK
modifications and identification
Database size (1)
New database, if the two following modifications are taken into account- modification occurring on amino acid A: A->a- modification occurring on amino acids L: L->l and E: E->e
= all the peptide from the initial database, plus all modified peptides that can be built from the
initial database
AAIEGKLMQRAPALK
AAIEGKaAIEGKAaIEGKaaIEGKAAIeGKaAIeGKAaIeGKaaIeGKLMQRlMQR APALKaPALKAPaLKaPaLKAPAlKaPAlKAPalKaPalK
B(L,p,k) gives the probability to have k positions of modification in a sequence of lenght L, if p is the probability that a position may be modified
(we assume the positions to be independent)
Aim: assess the number of peptides that contain zero, one, two... "positions" for a possible modification
kLkk
Lk ppnCkpLnN 1),,(B
k
kkNn
max
0
L = 10, p = 1/20:800'000 = 478'990 + 252'100 + 59'710 + 8'380 + 771 + cL= 10, p= 5/20:800'000 = 45'050 + 150'169 + 225'254 + 200'225 + 116'798 + c
N0
N1
N2
xxxx
oxxxxoxxxxoxxxxo
ooxxoxoxoxxoxooxxoxoxxoo
modifications and identification
Database size (2)
Expected number s of peptides that may contain exactly M modifications
Expected size of database when taking into account 0 to M modifications
maxL
Mkk
MkM NCs
M
iitoM ss
00
N0
N1
N2
xxxx
oxxxxoxxxxoxxxxo
ooxxooxxoxoxoxox...
modifications and identification
Database size (3)
SwissProt Human, 10'000 proteins n = 806'787 peptides [300,3000] (=~from 3 to 30
aa) L = 11 amino acids
0 to 3 modifications occuring on one specific amino acid: p=1/20P0to3_mod = 1'375'700 + c
0 to 3 modifications that may occur on several loci:Phosphorylation: H,D,S,T,Y (eucaryotes): p = 5/20P0to3_mod = 4'865'100 + c
0 to 3 modifications that may occur on every amino acid: p=1P0to3_mod = 3,97e12 + c
Mutation scenario: Each amino acid may mutate into one of the remaining 19 amino acids:All possible words = 19k-1P1_mut = 1.16e14
modifications and identification
Database size (3)
2 major problems:- size of the database- a priori knowledge on the deltaMass due to the modification
Solutions:Define an identification algorithm that is not based on a SPC
--> spectral convolution/alignment- PEDENTA (2000)
--> de novo sequencing followed by sequence matching- extraction of one or several complete sequences LUTEFISK (1997), SHERENGA (1999)...- extraction of one or several small tags (PeptideSearch, 1994), Patchwork sequencing...
--> Popitam (2003): "guided" sequencing
modifications and identification
Other strategies
Spectral convolution/alignment
SPC score: D(k=0) = 2SA score: D(k=2) = 6
Pevzner PA, Dancik V, Tang CL: Mutation-tolerant protein identification by mass spectrometry. J.Comput.Biol. 2000, 7:777-787
A
B
C
D
F
A B C D E F
E
theo. MS/MS spectrum
exp.
MS/M
S s
pect
rum
Key idea: k-similarity D(k)Given Sexp and Stheo, the goal is to find a serie of k shifts in Sexp that makes Sexp and Stheo as similar as possible.D(k) represents the maximum number of elements in common between a theoretical and an experimental spectrum after k shifts
),()'',(
max)(jiji
ij kD1)('' kD ji
1)1('' kD ji
if (i',j') and (i,j) are co-diagonalotherwise
modifications and identification
De novo sequencing
Taylor JA, Johnson RS: Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun.Mass Spectrom. 1997, 11:1067-1075
4/24
Longest path problem in a directed acyclic graph --> dynamic programming--> complete sequences --> mutations, but no modifications
modifications and identification
Tag extraction
Mann M, Wilm M: Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal.Chem. 1994, 66:4390-4399
Schlosser A, Lehmann WD: Patchwork peptide sequencing: Extraction of sequence information from accurate mass data of peptide tandem mass spectra recorded at high resolution*. Proteomics. 2002, 2:524-533
Island of sequence ionsThe tags (m1-SEQ-m2) are manually extracted2 steps: tags as filtering, then SPC
Based on very accurate masses (10 mDa)Small tags are extracted from low mass regions (2 aa)
modifications and identification
Popitam key's idea
Spectrum graph--> good way to structure the information
contained in the MS/MS spectrum, allows mutations
Tags--> modified source peptides--> fragmented spectra
Search space--> use dtb information during tag extraction --> take into account only mutations compatible
with the spectrum (graph)--> make only modification scenarios compatible
with the current theoretical peptide
Scoring function --> take into account a lot of parameters--> genetic programming
Popitam
Popitam overview
filter
Peptide sequencedatabase
any source of biological sequences
P1
P2
...
IDENTIFICATION
I(P1)
I(P2)
...
7/12
MS/MS
For each Pi
extractTags(); processTags(); score();
Popitam
initial node
final node
Spectrum graph
“N-term”: bMass = chargeNb * m/z – (chargeNb-1) – offset
“C-term”: bMass = PM – […]
measured mass [m/z]
bMass (ideal fragmentation)
- selection based on intensity- for each peak, make all possible hypotheses
b+-NH3
b+y++
a+-H20
- # nodes > # peaks- families
5/12
Popitam
Tag extraction
peLTEpeLetpeLvmpeITEpeIetpeIvmpetlE
LTELetLvmITEIetIvmtlE
ckTEetvmgoEV
Popitam
9 nodes,11 edges--> 21 tags
AIGGGLSSVGGSSTIK (1159 peaks)
1 16/97 5.6*104 0m02s2 30/338 5.4*106 0m27s3 44/692 5.7*107 3m16s4 58/1121 3.4*108 21m09s5 72/1667 2.3*109 2h17m07s
AHFSISNSAEDPFIAIHADSK(145 peaks)
1 24/121 6.1*104 0m02s2 46/308 1.9*108 16m15s3 68/831 2.0*1010 22h06m47s
LVNELTEFAK (125 peaks)
Tag extraction (2)
Pentium, 1.6 GHz
Popitam
Tag extraction (3)
ACCACMCAK-
CACMCAK
k
A
MCAK
MCAKCA
CMCAKCACMCAK
C k
Recursively extract from the graph all tags that are compatible with the current theoretical peptide
--> a tag = a path (bMass, edge label, ionic hypothesis…)
k
MCAK
Popitam
KplALVYGE 30 39 43 45 50 58 63 64 68 plALVYGE 39 43 45 50 58 63 64 68
ALVYGE 43 45 50 58 63 64 68 LVYGE 45 50 58 63 64 68 VYGE 50 58 63 64 68 YGE 58 63 64 68
paLKplALvy 0 4 10 16 22 26 31 42 LKplALvy 4 10 16 22 26 31 42 KplALvy 10 16 22 26 31 42 plALvy 16 22 26 31 42 ALvy 22 26 31 42
LKPla 10 13 19 22 31 LKPla 10 14 19 22 31 KPla 13 19 22 31 KPla 14 19 22 31
Tag processing
- discard subtags- discard tags that begin the theo. peptide, but not the graph (and vice versa)- discard tags that finish on the last aa, but not on the last node- group "family" tags
PLAlv 29 35 40 42 48 LAlv 35 40 42 48 DpaL 65 69 78 84 LKP 11 15 20 24 LVY 16 19 24 29 LVY 44 49 57 62PAL 19 22 26 31 QDP 10 16 20 24alkpL 54 63 71 75 avVqd 0 5 9 18 dpAL 37 43 45 50 avVQD 55 60 65 70 75 VQD 60 65 70 75paLK 59 66 69 75
AVVQDPALKPLALVYGEATSRPeakNb : 1260ParentMass : 2197.15NodeNb : 86EdgeNb : 142 / 1098
29 tags --> 13 subSeqs
Popitam
Aim:
Find all possible arrangements of subsequences, given the theoretical peptideBUTdo not include in a same arrangement tags that are incompatible with the others.
Compatibility rules:
--> no peak shared
--> beginMasses must respect positions in the sequences
Subsequence processing (1)
0 KplALVYGE 794.41 0 1 2 6 15 19 21 27 30 1 LKPla 282.17 2 7 29 33 41 2 PLAlv 785.34 6 8 19 21 28 3 DpaL 1673.89 14 20 31 36 4 LKP 284.11 17 22 32 36 5 LVY 410.26 14 22 28 29 ...
A V V Q D P A L K P L A L V Y G E A T S R0 5 10 15
0 1 2 3 4 5 ... 0 x x 1 x x 2 x x 3 x x x x 4 x 5 x x x ...
Compatibility graph
Each found clique in the graph is a possible arrangement of subsequencesHere, 91 cliques, but most of them are really uninteresting.
Popitam
Scoring function (1)
AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 LVY 1202.7
AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 LVY 1202.7 avVqd 1.0
AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 avVqd 1.0
AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LVY 1202.7
...
--> 2 levels scoring:
- scoring linked to the subsequences (local)
subscores:number of tags that compose the subsequence
length of the subsequence
occurrence probabilities of the ionic type hypothesized (geometric/arithmetic mean)
- scoring linked to the arrangement (global)
subscores:global coverage
linear regression
Popitam
Scoring function (2)
How can we combine the subscores in order to build an efficient scoring function ?
--> empirical function (expert knowledge)--> probabilitic function
--> function built using GENETIC PROGRAMMING
population of "programs" : trees nodes : mathematic operators (+, -, *, /, ^, ...) bolean operators (AND, OR, NOT...)
conditional operators (if-then-else...)iterative functions (do-until...)other specific functions...
leaves : subscores, coefficient
Popitam
GENETIC PROGRAMMING
Genetic operators (1)
Initiation:Programs are initially randomly determined (structure, functions, values)
Iterations:At each iteration, the programs are evaluated (fitness function). Only the best are allowed to reproduce, using genetic operators (permutation, mutation, crossing-over...).
Popitam
Genetic operators (2)Popitam
Genetic programming
fitness
tree population
fitness
genetic programming allows testing several scoring functions and making them "cleverly" evolve in order to find an optimal one
fitness
Popitam
Popitam
Popitam
Popitam
scoring function1
scoring function3scoring
function2
if (correctId() ) si ]0.5;1[ (according to the discriminative power)else { if (belongToList() ) si ]0;0.5] (according to the position in the list) else si = 0;
Some resultsPopitam