Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and...

24
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Xue Wu, Chau-Wen Tseng Department of Computer Science University of Maryland, College Park

Transcript of Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and...

Page 1: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

Improving the Sensitivityof Peptide Identification

Nathan EdwardsDepartment of Biochemistry and

Molecular & Cellular BiologyGeorgetown University Medical Center

Xue Wu, Chau-Wen TsengDepartment of Computer Science

University of Maryland, College Park

Page 2: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

2

Lost peptide identifications

Missing from the sequence database

Search engine strengths, weaknesses, quirks

Poor score or statistical significance

Thorough search takes too long

Page 3: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

3

Lost peptide identifications

Missing from the sequence database Build exhaustive peptide sequence databases

Search engine strengths, weaknesses, quirks Use multiple search engines and combine results

Poor score or statistical significance Use spectral-matching to identify weak spectra Use search-engine consensus to boost confidence Use machine-learning to distinguish true from false

Thorough search takes too long Harness the power of heterogeneous computational grids

Page 4: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

4

Peptide Sequence Databases

All peptides at most 30 amino-acids long from: IPI and all IPI constituent protein sequences

IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank

SwissProt variants, conflicts, splices, and signal peptide truncations.

Genbank and RefSeq mRNA sequence 3 frame translation

GenBank EST and HTC sequences 6 frame translation and found in at least 2 sequences

Grouped by UniGene cluster and compressed.

Page 5: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

5

Formatted as a FASTA sequence database Easy integration with search engines.

One entry per gene/cluster. Automated rebuild every few months.

Peptide Sequence Databases

Organism Size (AA) Size (Entries)Human 209Mb 75,043Mouse 151Mb 55,929

Rat 67Mb 43,211Zebra-fish 90Mb 47,922

Page 6: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

6

Spectral Matching with HMMs

Page 7: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

7

Spectral Matching with HMMs

I0 b1I1 I2 I3 I4 I5 I6y1 b2 y2 b3 y3

11% 17% 6% 94% 8% 0% 11% 86% 17% 0% 6% 92% 19%

Page 8: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

8

Hidden Markov Model

Ion

Delete

Insert

(m/z,int) pair emitted by ion & insert states

Page 9: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

9

Boosting Identification Sensitivity

TestTrain Other

(High confidence ids only)

OtherModel None

(Low confidence ids)

Page 10: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

10

Spectral Matching of Peptide Variants

DFLAGGVAAAISK

DFLAGGIAAAISK

Page 11: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

11

Spectral Matching Extrapolation

Page 12: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

12

Comparison of search engine results

No single score is comprehensive

Search engines disagree

Many spectra lack confident peptide assignment

Searle et al. JPR 7(1), 2008

38%

14%28%

14%

3%

2%

1%

X! Tandem

SEQUESTMascot

Page 13: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

13

Combining search engine results – harder than it looks!

Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too!

How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance?

We apply unsupervised machine-learning.... Lots of related work unified in a single framework.

Page 14: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

14

Supervised Learning

Page 15: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

15

Unsupervised Learning

Page 16: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

16

PepArML Combining ResultsQ-TOF

LTQ

MALDI

Page 17: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

17

Unsupervised Learning

HC-TMO

U-TMO

U*-TMO

False Positive Rate Iteration

Page 18: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

18

Searching for Consensus

Search engine quirks can destroy consensus Initial methionine loss as tryptic peptide Charge state enumeration or guessing X!Tandem's refinement mode Pyro-Gln, Pyro-Glu modifications Difficulty tracking spectrum identifiers Precursor mass tolerance (Da vs ppm)

Decoy searches must be identical!

Page 19: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

19

Configuring for Consensus

Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially

modifications and protein identifiers

Page 20: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

20

Peptide Identification Meta-Search Parameters

Instrument Precursor Tolerance Fragment Tolerance Max. Charge

Sequence Database Target/Decoy

Modification Fixed/Variable Amino-Acids Position Delta

Proteolytic Agent Motif

Peptide Candidates Termini Specificity Precursor Tolerance Missed cleavages Charge State Handling # 13C Peaks

Search Engines Mascot, X!Tandem OMSSA, MyriMatch

Page 21: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

21

Peptide Identification Meta-Search

Simple unified search interface for: Mascot, X!Tandem OMSSA, Myrimatch

Automatic decoy searches

Automatic spectrumfile "chunking"

Automatic scheduling Serial, Multi-Processor, Cluster, Grid

Page 22: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

22

Peptide Identification Meta-Search

NSF TeraGrid1000+ CPUs

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Securecommunication

Heterogeneouscompute resources

Simple searchrequest

Page 23: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

23

Conclusions

Improve sensitivity of peptide identification Exhaustive peptide sequence databases Machine-learning for matching and combining Meta-search tools maximize consensus Grid-computing to achieve thorough search

Page 24: Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

24

Acknowledgements

Catherine Fenselau University of Maryland Biochemistry

Funding: NIH/NCI, USDA/ARS