Post on 21-Jan-2016
Improving the Sensitivityof Peptide Identification
Nathan EdwardsDepartment of Biochemistry and
Molecular & Cellular BiologyGeorgetown University Medical Center
Xue Wu, Chau-Wen TsengDepartment of Computer Science
University of Maryland, College Park
2
Lost peptide identifications
Missing from the sequence database
Search engine strengths, weaknesses, quirks
Poor score or statistical significance
Thorough search takes too long
3
Lost peptide identifications
Missing from the sequence database Build exhaustive peptide sequence databases
Search engine strengths, weaknesses, quirks Use multiple search engines and combine results
Poor score or statistical significance Use spectral-matching to identify weak spectra Use search-engine consensus to boost confidence Use machine-learning to distinguish true from false
Thorough search takes too long Harness the power of heterogeneous computational grids
4
Peptide Sequence Databases
All peptides at most 30 amino-acids long from: IPI and all IPI constituent protein sequences
IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank
SwissProt variants, conflicts, splices, and signal peptide truncations.
Genbank and RefSeq mRNA sequence 3 frame translation
GenBank EST and HTC sequences 6 frame translation and found in at least 2 sequences
Grouped by UniGene cluster and compressed.
5
Formatted as a FASTA sequence database Easy integration with search engines.
One entry per gene/cluster. Automated rebuild every few months.
Peptide Sequence Databases
Organism Size (AA) Size (Entries)Human 209Mb 75,043Mouse 151Mb 55,929
Rat 67Mb 43,211Zebra-fish 90Mb 47,922
6
Spectral Matching with HMMs
7
Spectral Matching with HMMs
I0 b1I1 I2 I3 I4 I5 I6y1 b2 y2 b3 y3
11% 17% 6% 94% 8% 0% 11% 86% 17% 0% 6% 92% 19%
8
Hidden Markov Model
Ion
Delete
Insert
(m/z,int) pair emitted by ion & insert states
9
Boosting Identification Sensitivity
TestTrain Other
(High confidence ids only)
OtherModel None
(Low confidence ids)
10
Spectral Matching of Peptide Variants
DFLAGGVAAAISK
DFLAGGIAAAISK
11
Spectral Matching Extrapolation
12
Comparison of search engine results
No single score is comprehensive
Search engines disagree
Many spectra lack confident peptide assignment
Searle et al. JPR 7(1), 2008
38%
14%28%
14%
3%
2%
1%
X! Tandem
SEQUESTMascot
13
Combining search engine results – harder than it looks!
Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too!
How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance?
We apply unsupervised machine-learning.... Lots of related work unified in a single framework.
14
Supervised Learning
15
Unsupervised Learning
16
PepArML Combining ResultsQ-TOF
LTQ
MALDI
17
Unsupervised Learning
HC-TMO
U-TMO
U*-TMO
False Positive Rate Iteration
18
Searching for Consensus
Search engine quirks can destroy consensus Initial methionine loss as tryptic peptide Charge state enumeration or guessing X!Tandem's refinement mode Pyro-Gln, Pyro-Glu modifications Difficulty tracking spectrum identifiers Precursor mass tolerance (Da vs ppm)
Decoy searches must be identical!
19
Configuring for Consensus
Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially
modifications and protein identifiers
20
Peptide Identification Meta-Search Parameters
Instrument Precursor Tolerance Fragment Tolerance Max. Charge
Sequence Database Target/Decoy
Modification Fixed/Variable Amino-Acids Position Delta
Proteolytic Agent Motif
Peptide Candidates Termini Specificity Precursor Tolerance Missed cleavages Charge State Handling # 13C Peaks
Search Engines Mascot, X!Tandem OMSSA, MyriMatch
21
Peptide Identification Meta-Search
Simple unified search interface for: Mascot, X!Tandem OMSSA, Myrimatch
Automatic decoy searches
Automatic spectrumfile "chunking"
Automatic scheduling Serial, Multi-Processor, Cluster, Grid
22
Peptide Identification Meta-Search
NSF TeraGrid1000+ CPUs
UMIACS250+ CPUs
Edwards LabScheduler &48+ CPUs
Securecommunication
Heterogeneouscompute resources
Simple searchrequest
23
Conclusions
Improve sensitivity of peptide identification Exhaustive peptide sequence databases Machine-learning for matching and combining Meta-search tools maximize consensus Grid-computing to achieve thorough search
24
Acknowledgements
Catherine Fenselau University of Maryland Biochemistry
Funding: NIH/NCI, USDA/ARS