PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu,...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu,...
PepArML: A model-free, result-
combining peptide
identification arbiter via
machine learning
PepArML: A model-free, result-
combining peptide
identification arbiter via
machine learning
Xue Wu, Chau-Wen Tseng, Nathan Edwards
University of Maryland, College Park, andGeorgetown University Medical Center
2
Comparison of Search Engines
• No single score is comprehensive
• Search engines disagree
• Many spectra lack confident peptide assignment
• Many spectra lack any peptide assignment
Searle et al. JPR 7(1), 2008
38%
14%28%
14%
3%
2%
1%
X! Tandem
SEQUESTMascot
3
Black-box Techniques
• Significance re-estimation• Target-Decoy search• Bimodal distribution fit
• Supervised machine learning• Train predictors on synthetic datasets• Select and/or create (many) good features
• Result combiners• Incorrect peptide IDs unlikely to match• Significance re-estimation• Independence and/or supervised model
4
PepArML
• Unified machine learning result combiner• Significance re-estimation too!
• Model-free feature use and result combination• Use agreement and features if useful
• Unsupervised training procedure• No loss of classification performance
5
PepArML Overview
X!Tandem
Mascot
OMSSA
Other
PepArML
6
PepArML Overview
X!Tandem
Mascot
OMSSA
Other
PepArML
Feature extraction
7
Dataset Construction
T),( 11 PS
F),( 21 PS
T),( 12 PS
X!Tandem Mascot OMSSA
T),( mn PS
……
8
Dataset Construction
• Calibrant 8 Protein Mix (C8) • 4594 MS/MS spectra (LTQ)• 618 (11.2%) true positives
• Sashimi 17mix_test2 (S17)• 1389 MS/MS spectra (Q-TOF)• 354 (25.4%) true positives
• AURUM 1.0 (364 Proteins)• 7508 MS/MS spectra (MALDI-TOF-TOF)• 3775 (50.3%) true positives
9
PepArML Machine Learning
• Machine learning (generally) helps single search engines
• PepArML result-combiner (C-TMO) improves on single search engines
• Sometimes combining two search engines works as well, or better, than three
10
PepArML vs Search Engines (C8)
11
True vs. Est. FDR (C-TMO, C8)
12
PepArML vs Search Engines (C8)
13
PepArML Pairs vs PepArML (C8)
14
Sensitivity Comparison
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(10%, C8) (10%, S17) (10%, AURUM)
Classifier (FDR, dataset)
Sen
siti
vity
C-TMO
C-TM
C-TO
C-MO
C-T
C-M
C-O
Tandem
Mascot
OMSSA
15
Feature Evaluation
0
0.2
0.4
0.6
0.8
Info
Gai
n
C8
0
0.2
0.4
0.6
0.8
Info
Gai
n
S17
0
0.2
0.4
0.6
0.8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Info
Gai
n
AURUM
1 Peptide length2 hyperscore3 precursor mass delta4 # of matched y-ions5 # of matched b-ions6 # of missed cleavages7 sum matched intensity8 E-value9 sentinel
10 score11 precursor mass delta12 # of matched ions13 # of matched peaks14 # of missed cleavages15 E-value16 sentinel17 p-value18 # of matched ions19 E-value20 sentinel
Ta
nde
mO
MS
SA
Ma
sco
t
16
Application to Real Data
• How well do these models generalize?
• Different instruments• Spectral characteristics change scores
• Search parameters• Different parameters change score values
• Supervised learning requires• (Synthetic) experimental data from every instrument• Search results from available search engines• Training/models for all
parameters x search engine sets x instruments
17
Model Generalization
Train C8 / Score S17
Train S17 / Score S17
18
Rescuing Machine Learning
• Train a new machine learning model for every dataset!• Generalization not required• No predetermined search engines, parameters,
instruments, features
• Perhaps we can “guess” the true proteins• Most proteins not in doubt• Machine learning can tolerate imperfect labels
19
Unsupervised Learning
20
Unsupervised Learning (S17)
21
Unsupervised Learning (S17)
22
Protein Selection Heuristic
• Modeled on typical protein identification criteria• High confidence peptide IDs• At least 2 non-overlapping peptides• At least 10% sequence coverage
• Robust, fast convergence
• Easily enforce additional constraints
23
What about real data?
Dr. Rado Goldman (LCCC, GUMC)• Proteolytic serum peptides from clinical
hepatocellular carcinoma samples• ~ 200 MALDI MS/MS Spectra (TOF-TOF)
PepArML for non-specific search of IPI-Human
• Increase in confidence & sensitivity• Observation of “ragged” proteolytic trimming
24
Protein Identification Example
M T OKey E-value
< 1e-5< 0.05Any IDNo ID
*
25
Future Directions
• Apply to more experimental datasets• Integrate
• novel features• new search engines, spectral matching• multiple searches with varied parameters,
sequence databases• Construct meta-search engine• FDR by bimodal fit instead of decoys• Release as open source
• http://peparml.sourceforge.org
26
http://PepArML.SourceForge.Net
27
Acknowledgements
• Xue Wu* & Dr. Chau-Wen Tseng, • Computer Science
University of Maryland, College Park• Dr. Brian Balgley, Dr. Paul Rudnick
• Calibrant Biosystems & NIST• Dr. Rado Goldman, Dr. Yanming An
• Department of OncologyGeorgetown University Medical Center
• Kam Ho To• Biochemistry Masters student
Georgetown University
• Funding: NIH/NCI CPTAC
28
29
PepArML vs Search Engines (S17)
30
PepArML vs Search Engines (S17)
31
PepArML Pairs vs PepArML (C8)
32
PepArML Pairs vs PepArML (S17)
33
PepArML Pairs vs PepArML (S17)
34
Unsupervised Learning (C8)
35
Unsupervised Learning (C8)