Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and...

Improving the Sensitivityof Peptide Identification

Nathan EdwardsDepartment of Biochemistry and

Molecular & Cellular BiologyGeorgetown University Medical Center

Xue Wu, Chau-Wen TsengDepartment of Computer Science

University of Maryland, College Park

Lost peptide identifications

Missing from the sequence database

Search engine strengths, weaknesses, quirks

Poor score or statistical significance

Thorough search takes too long

Lost peptide identifications

Missing from the sequence database Build exhaustive peptide sequence databases

Search engine strengths, weaknesses, quirks Use multiple search engines and combine results

Poor score or statistical significance Use spectral-matching to identify weak spectra Use search-engine consensus to boost confidence Use machine-learning to distinguish true from false

Thorough search takes too long Harness the power of heterogeneous computational grids

Peptide Sequence Databases

All peptides at most 30 amino-acids long from: IPI and all IPI constituent protein sequences

IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank

SwissProt variants, conflicts, splices, and signal peptide truncations.

Genbank and RefSeq mRNA sequence 3 frame translation

GenBank EST and HTC sequences 6 frame translation and found in at least 2 sequences

Grouped by UniGene cluster and compressed.

Formatted as a FASTA sequence database Easy integration with search engines.

One entry per gene/cluster. Automated rebuild every few months.

Peptide Sequence Databases

Organism Size (AA) Size (Entries)Human 209Mb 75,043Mouse 151Mb 55,929

Rat 67Mb 43,211Zebra-fish 90Mb 47,922

Spectral Matching with HMMs

I0 b1I1 I2 I3 I4 I5 I6y1 b2 y2 b3 y3

11% 17% 6% 94% 8% 0% 11% 86% 17% 0% 6% 92% 19%

Hidden Markov Model

Delete

Insert

(m/z,int) pair emitted by ion & insert states

Boosting Identification Sensitivity

TestTrain Other

(High confidence ids only)

OtherModel None

(Low confidence ids)

Spectral Matching of Peptide Variants

DFLAGGVAAAISK

DFLAGGIAAAISK

Spectral Matching Extrapolation

Comparison of search engine results

No single score is comprehensive

Search engines disagree

Many spectra lack confident peptide assignment

Searle et al. JPR 7(1), 2008

14%28%

X! Tandem

SEQUESTMascot

Combining search engine results – harder than it looks!

Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too!

How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance?

We apply unsupervised machine-learning.... Lots of related work unified in a single framework.

Supervised Learning

Unsupervised Learning

PepArML Combining ResultsQ-TOF

Unsupervised Learning

HC-TMO

U*-TMO

False Positive Rate Iteration

Searching for Consensus

Search engine quirks can destroy consensus Initial methionine loss as tryptic peptide Charge state enumeration or guessing X!Tandem's refinement mode Pyro-Gln, Pyro-Glu modifications Difficulty tracking spectrum identifiers Precursor mass tolerance (Da vs ppm)

Decoy searches must be identical!

Configuring for Consensus

Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially

modifications and protein identifiers

Peptide Identification Meta-Search Parameters

Instrument Precursor Tolerance Fragment Tolerance Max. Charge

Sequence Database Target/Decoy

Modification Fixed/Variable Amino-Acids Position Delta

Proteolytic Agent Motif

Peptide Candidates Termini Specificity Precursor Tolerance Missed cleavages Charge State Handling # 13C Peaks

Search Engines Mascot, X!Tandem OMSSA, MyriMatch

Peptide Identification Meta-Search

Simple unified search interface for: Mascot, X!Tandem OMSSA, Myrimatch

Automatic decoy searches

Automatic spectrumfile "chunking"

Automatic scheduling Serial, Multi-Processor, Cluster, Grid

Peptide Identification Meta-Search

NSF TeraGrid1000+ CPUs

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Securecommunication

Heterogeneouscompute resources

Simple searchrequest

Conclusions

Improve sensitivity of peptide identification Exhaustive peptide sequence databases Machine-learning for matching and combining Meta-search tools maximize consensus Grid-computing to achieve thorough search

Acknowledgements

Catherine Fenselau University of Maryland Biochemistry

Funding: NIH/NCI, USDA/ARS

Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and...

Documents

Transcript of Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and...

Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.

Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.

ANYGEN - prokcssmedia.blob.core.windows.net€¦ · the peptide synthesis. Custom Peptide Catalog Peptide CMO Service Peptide Service - ISO 9001, 14001 - GMP. 4 ANYGEN 5 Custom Peptide

Fueling the Fire - Georgetown University1 Nathan Kalmoe is a Postdoctoral Scientist at George Washington University’s School of Media and Public Affairs. This project was supported

2010 Georgetown Football Insert Poster - Marist at Georgetown

State of Georgetown - Georgetown BIDbid.georgetowndc.com/media/uploads/state_of_gtown_2016.pdf · THE STATE OF GEORGETOWN 2016 REPORT 5 During calendar year 2015, Georgetown thrived

Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research Lecture.

(NH/ 15 N) (ppm) Peptide II Peptide I Peptide III Peptide IV Peptide V Efb residues A29 – R165 Figure 3.10 Chemical shift perturbation of Efb upon titration.

Proteomics and Glycoproteomics (Bio-)Informatics of Protein Isoforms Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.

SpyLigase peptide–peptide ligation polymerizes affibodies to ...

Generating peptide probes against cancer-related peptide ... · Generating peptide probes against cancer-related peptide recognition domains using phage display Yogesh Hooda Master

Oligopolistic Price Leadership and Mergers: The …...Oligopolistic Price Leadership and Mergers: The United States Beer Industry Nathan H. Millery Georgetown University Gloria Sheuz

Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

Protein Identification by Sequence Database Search Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.

Melissa Baralt Georgetown University mlb65@georgetown

Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Page 1 Did Sammy Sosa Take Physics 101 Alan M. Nathan University of Illinois at Urbana-Champaign Georgetown Colloquium, April 6, 2004 June 3, 2003.

GEORGETOWN HOSPITAL SYSTEM GEORGETOWN MEMORIAL HOSPITAL WACCAMAW COMMUNITY HOSPITAL

ABIOTIC PEPTIDE SYNTHESIS Solid Phase Peptide Synthesis … · ABIOTIC PEPTIDE SYNTHESIS Solid Phase Peptide Synthesis (SPPS) ... His, N-terminal end of the peptide Amino acids with

Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University.