Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and...

Post on 21-Jan-2016

216 views 0 download

Tags:

Transcript of Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and...

Improving the Sensitivityof Peptide Identification

Nathan EdwardsDepartment of Biochemistry and

Molecular & Cellular BiologyGeorgetown University Medical Center

Xue Wu, Chau-Wen TsengDepartment of Computer Science

University of Maryland, College Park

2

Lost peptide identifications

Missing from the sequence database

Search engine strengths, weaknesses, quirks

Poor score or statistical significance

Thorough search takes too long

3

Lost peptide identifications

Missing from the sequence database Build exhaustive peptide sequence databases

Search engine strengths, weaknesses, quirks Use multiple search engines and combine results

Poor score or statistical significance Use spectral-matching to identify weak spectra Use search-engine consensus to boost confidence Use machine-learning to distinguish true from false

Thorough search takes too long Harness the power of heterogeneous computational grids

4

Peptide Sequence Databases

All peptides at most 30 amino-acids long from: IPI and all IPI constituent protein sequences

IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank

SwissProt variants, conflicts, splices, and signal peptide truncations.

Genbank and RefSeq mRNA sequence 3 frame translation

GenBank EST and HTC sequences 6 frame translation and found in at least 2 sequences

Grouped by UniGene cluster and compressed.

5

Formatted as a FASTA sequence database Easy integration with search engines.

One entry per gene/cluster. Automated rebuild every few months.

Peptide Sequence Databases

Organism Size (AA) Size (Entries)Human 209Mb 75,043Mouse 151Mb 55,929

Rat 67Mb 43,211Zebra-fish 90Mb 47,922

6

Spectral Matching with HMMs

7

Spectral Matching with HMMs

I0 b1I1 I2 I3 I4 I5 I6y1 b2 y2 b3 y3

11% 17% 6% 94% 8% 0% 11% 86% 17% 0% 6% 92% 19%

8

Hidden Markov Model

Ion

Delete

Insert

(m/z,int) pair emitted by ion & insert states

9

Boosting Identification Sensitivity

TestTrain Other

(High confidence ids only)

OtherModel None

(Low confidence ids)

10

Spectral Matching of Peptide Variants

DFLAGGVAAAISK

DFLAGGIAAAISK

11

Spectral Matching Extrapolation

12

Comparison of search engine results

No single score is comprehensive

Search engines disagree

Many spectra lack confident peptide assignment

Searle et al. JPR 7(1), 2008

38%

14%28%

14%

3%

2%

1%

X! Tandem

SEQUESTMascot

13

Combining search engine results – harder than it looks!

Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too!

How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance?

We apply unsupervised machine-learning.... Lots of related work unified in a single framework.

14

Supervised Learning

15

Unsupervised Learning

16

PepArML Combining ResultsQ-TOF

LTQ

MALDI

17

Unsupervised Learning

HC-TMO

U-TMO

U*-TMO

False Positive Rate Iteration

18

Searching for Consensus

Search engine quirks can destroy consensus Initial methionine loss as tryptic peptide Charge state enumeration or guessing X!Tandem's refinement mode Pyro-Gln, Pyro-Glu modifications Difficulty tracking spectrum identifiers Precursor mass tolerance (Da vs ppm)

Decoy searches must be identical!

19

Configuring for Consensus

Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially

modifications and protein identifiers

20

Peptide Identification Meta-Search Parameters

Instrument Precursor Tolerance Fragment Tolerance Max. Charge

Sequence Database Target/Decoy

Modification Fixed/Variable Amino-Acids Position Delta

Proteolytic Agent Motif

Peptide Candidates Termini Specificity Precursor Tolerance Missed cleavages Charge State Handling # 13C Peaks

Search Engines Mascot, X!Tandem OMSSA, MyriMatch

21

Peptide Identification Meta-Search

Simple unified search interface for: Mascot, X!Tandem OMSSA, Myrimatch

Automatic decoy searches

Automatic spectrumfile "chunking"

Automatic scheduling Serial, Multi-Processor, Cluster, Grid

22

Peptide Identification Meta-Search

NSF TeraGrid1000+ CPUs

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Securecommunication

Heterogeneouscompute resources

Simple searchrequest

23

Conclusions

Improve sensitivity of peptide identification Exhaustive peptide sequence databases Machine-learning for matching and combining Meta-search tools maximize consensus Grid-computing to achieve thorough search

24

Acknowledgements

Catherine Fenselau University of Maryland Biochemistry

Funding: NIH/NCI, USDA/ARS