Practical Guide to Significantly Improve Peptide Identification Sensitivity and Accuracy
description
Transcript of Practical Guide to Significantly Improve Peptide Identification Sensitivity and Accuracy
Practical Guide to Significantly Improve Peptide Identification
Sensitivity and Accuracy
Bin Ma, CTOBioinformatics Solutions Inc.
June 5, 2011.
The Sensitivity and Accuracy Dilemma
score
false
true
FDR# reported false hits
# reported hits
Publication Guideline• Earlier experiments paid too much attention on sensitivity and
not enough on accuracy.• MCP started the guideline in 2004 to ensure accuracy.
People are generally over-optimistic about how reliable their results are.– ABRF iPRG 2011.
1%
iPRG/ABRF 2011 Study
30 out of 45 submissions have FDR much higher than the required 1%
Estimated FDR lower bound
Estimated FDR upper bound
“ ”
PEAKS Achieved both Sensitivity and Accuracy
1%
PEAKS PEAKS
More peptides in submission
Outline
1. FDR – pitfalls and solutions2. De novo sequencing assisted database search3. Three essential examinations to ensure result
quality.
1. FDR – pitfalls and solutions
FDR Estimation
Search Engine
𝐹𝐷𝑅=¿𝑑𝑒𝑐𝑜𝑦¿ 𝑡𝑎𝑟𝑔𝑒𝑡
target
decoy # decoy hits
Protein DB
Identified Peptides
# false target hits ≈
Pitfall 1 – Multiple Round Search
Round 1. Fast Search
Round 2. More Sensitive Search
FDR underestimation.
# decoy hits# false target hits ¿
more targets than decoys
Craig and Beavis 2004. Bioinformatics 20, 1466–67.
Bern and Kil 2011, J Proteome Res. 10, 2123-27.
Evertt et al. 2010. J Proteome Res. 9, 700-707.
Our Solution: Decoy Fusion
Fast Search
More Sensitive Search
Decoy sequence append to each target protein.
PEAKS DB paper. Submitted.
Equal targets and decoys
# decoy hits# false target hits ≈
Pitfall 2 – Mix Protein and Peptide ID
Idea: Peptides on a multi-hit protein get a bonus on their scores to increase sensitivity.
Pitfall
More multi-hit proteins from target DB more false hits are “saved” from target DBFDR underestimation.
A weak hit is “saved” due to the bonus.
So is this weak false hit.
decoy hit
target false hit
Our Solution: Decoy FusionWeak false hits are “saved” with approx. equal probabilities in target and decoy.
Get the sensitivity, but still estimate the FDR correctly.
Pitfall 3 – Machine Learning with Decoy
Idea: Re-train the coefficients of scoring function for every search after knowing the decoy hits.Pitfall: Risk of over-fit. Machine learning experts only.
Adjust scoring function to remove decoy hits after search.
Fewer target false hits are removedFDR underestimation
Search
target false hits
decoy hits
Solutions
1. Don’t use it. Judges cannot be players.
2. Only use for very large dataset.3. Train coefficients and reuse; don’t re-train
for every search.
oror
PEAKS 5.3
• PEAKS DB used all these techniques (and many more) to ensure the accuracy while maximizing sensitivity.
• Reliable FDR estimation is the top priority in PEAKS DB design.
2. De novo sequencing assisted database search
An Idea to Improve Score Function
score
false
true
Idea: If de novo matches a DB peptide, it is likely to be correct.
De Novo Assisted DB Search# matched amino acidsbetween de novo & DB search
x+4ybest separation line
DB Search Score
score
false
true
Including de novo matching as a feature gives the score function a better discriminative power.
before after
This is just one example of many other new features in PEAKS 5.3 for improving score function.
… far better than what I could ever squeeze out of my data – Stefano Gotta, Siena Biotech
0 500 1000 1500 2000 2500 3000 3500 40000.0%
0.5%
1.0%
1.5%
2.0%
2.5%
# of PSM
FDR
product M PEAKS DB
“ ”
DB search
Found?
Yes
No
De Novo
All Spectra
DB peptides De novo only
PEAKS DB Workflow
De novo both helps to improve DB search, and reports novel peptides.
3. Three essential examinations to ensure result quality.
Don’t Trust Software Blindly!• Google “Don’t trust software blindly” returned
5,140,000 results.• As you quality control your experiments,
quality control the software’s results too.
Essential Examination 1
#decoy #targetin low score region
Low #decoy in high score region
Essential Examination 2
High scoring peptidesshould have low precursor error.
Precursor error start to scatterbelow threshold
Essential Examination 3• Spectrum annotation around score threshold.
Take Home Message
• Another year of dedicated work on PEAKS.• Ensured accuracy; maximized sensitivity.• Do the three essential examinations.– They are simple … at least in PEAKS.
“a big step forward” – Christian Schmelzer, Martin Luther University
Enjoy!
http://www.bioinfor.com/peaks-download-a-pricing