THE TITIN PROBLEM: HITCHHIKING SIBLINGS DURING PROTEIN ... · THE TITIN PROBLEM: HITCHHIKING...

1
THE TITIN PROBLEM: HITCHHIKING SIBLINGS DURING PROTEIN INFERENCE KYLE LUCKE, MAX THIBEAU, LEVI ZELL, JULIANUS PFEUFFER, XIAO LIANG, AND OLIVER SERANG ////////////// DEPARTMENT OF COMPUTER SCIENCE 8 \\ ACKNOWLEDGMENT Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number P20GM103546. This material is based upon work supported by the National Science Foundation under grant no. 1845465. NIH COBRE NSF CAREER 6 \\ REFERENCES 1 \\ THE FIDO PROTEIN INFERENCE MODEL 2 \\ "HITCHHIKING" PROTEINS 3 \\ EVERGREENFOREST: A SOLVER FOR PROBABILISTIC LINEAR DIOPHANTINE EQUATIONS 7 \\ AVAILABILITY 5 \\ TITIN SIMULATIONS Available free (MIT license) from https://bitbucket.org/orserang/evergreenforest You can use the EvergreenForest library as bith a header-only C++11 library or via the modeling language (run make in EvergreenForest/src/Language). 4 \\ PROTEIN INFERENCE MODELS Proteins Peptides Peptide probabilities 0.98 0.9 X 1 X 2 Y 1 Y 2 Possible present protein sets: Probability Y 1 is absent given that protein set: If a protein is present, how often will it make a peptide? How often does a peptide show up from error? A priori, percent of proteins are actually present? The FIDO model, a Bayesian generalization of vertex-cover methods [1]: We can introduce a change of variables to a "cardinal model": Proteins Peptides Peptide probabilities 0.98 0.9 X 1 X 2 Y 1 Y 2 Y 3 Y 4 Decoy X 1 Decoy X 2 0.03 0.01 Actually absent Actually present Not actually considered PMF (A,B) (0,0) [[0.15, 0.3, 0.1], [0.2, 0.3, 0.2]] PMF (B,C) (0,0) [[0.3, 0.2], [0.4, 0.1]] PMF (A,C) (-1,0) [[0.1, 0.2], [0.1, 0.3], [0.2, 0.1]] PMF (D) (-1) [0.1, 0.2, 0.3, 0.2, 0.1, 0.1] D-C=A+B #change to brute force: @engine=brute_force() Pr("BRUTE FORCE RESULTS") Pr(A; B) Pr( ) #change to LBP: @engine=loopy(@dampening=0.05,@epsilon=1e-6,@max_iter=10000) Pr("LOOPY RESULTS") Pr(A;B) Pr( ) Target proteins may "hitchhike" when they share peptides with a present target. This is more common than a target protein sharing peptides with a decoy protein. It can result in many identications with a very low FDR (all targets!), while in reality most of those targets are actually absent (actually high FDR!). Training for target- decoy discrimination may incentivize this bad result. BRUTE FORCE RESULTS A PMF:{[0] to [1]} t:[0.486486, 0.513514] B PMF:{[0] to [1]} t:[0.432432, 0.567568] Log probability of model: -6.06943 LOOPY RESULTS A PMF:{[0] to [1]} t:[0.47552, 0.52448] B PMF:{[0] to [1]} t:[0.416622, 0.583378] Log probability of model: -5.92171 REPRINT QR Photo me for a reprint! Automatic multithreading! TRIOT: Every tensor operation is unrolled to the right number of for loops at runtime! Lazy, trimmed convolution trees! FIDO: The classic! EPIFANY: Uses a prior on each N variable, which penalizes multiple proteins per peptide. [3] Mutex: Strong prior on N variables: at most, one protein allowed per peptide. Target-decoy labels: The classic! Choose parameters that maximize the target-decoy discrimination and calibration. Empirical Bayes: Choose the parameters that maximize likelihood. PAIRED WITH [1] O SERANG, M MACCOSS, AND W NOBLE. "EFFICIENT MARGINALIZATION TO COMPUTE PROTEIN POSTERIOR PROBABILITIES FROM SHOTGUN MASS SPECTROMETRY DATA." JOURNAL OF PROTEOME RESEARCH 9.10 (2010): 5346-5357. [2] O SERANG. "THE PROBABILISTIC CONVOLUTION TREE: EFFICIENT EXACT BAYESIAN INFERENCE FOR FASTER LC-MS/MS PROTEIN INFERENCE." PLOS ONE 9.3 (2014): E91507. [3] J PFEUFFER, TSACHSENBERG, T DIJKSTRA, O SERANG, K REINERT AND O KOHLBACHER. "EPIFANY - A METHOD FOR EFFICIENT HIGH-CONFIDENCE PROTEIN INFERENCE." (IN PREPARATION) Cardinal models can be solved very eciently using probabilistic convolution trees! [2] If we have n variables, each with states {0, 1, 2, ... k-1}, we can nd all posteriors simultaneously in Simulated using the following present proteins: sp|Q7L7L0|H2A3_HUMAN, sp| A6NJZ7|RIM3C_HUMAN, sp|P57059|SIK1_HUMAN, sp|Q5SQ80|A20A2_HUMAN, sp| Q99613|EIF3C_HUMAN, sp|P0DJD0|RGPD1_HUMAN, sp|Q5VU36|S31A5_HUMAN sp|Q8WZ42|TITIN_HUMAN. These proteins have many shared peptides. FIDO EPIFANY Mutex FIDO EPIFANY Mutex Target-decoy parameter est. Empirical Bayes parameter est. Avg. sensitivity @<10% FDR: 0.162 0.156 0.156 0.163 0.183 0.170 These simulations show that empirical Bayes is less resistant to overtting in the context of hitchhiking; likewise, they show that empirical Bayes with EPIFANY and Mutex models, which are more hawkish about shared peptides, are less likely to overt. Prototyping new models with EvergreenForest is easy!

Transcript of THE TITIN PROBLEM: HITCHHIKING SIBLINGS DURING PROTEIN ... · THE TITIN PROBLEM: HITCHHIKING...

Page 1: THE TITIN PROBLEM: HITCHHIKING SIBLINGS DURING PROTEIN ... · THE TITIN PROBLEM: HITCHHIKING SIBLINGS DURING PROTEIN INFERENCE KYLE LUCKE, MAX THIBEAU, LEVI ZELL, JULIANUS PFEUFFER,

THE TITIN PROBLEM: HITCHHIKING SIBLINGS DURING PROTEIN INFERENCE

KYLE LUCKE, MAX THIBEAU, LEVI ZELL, JULIANUS PFEUFFER, XIAO LIANG, AND OLIVER SERANG ////////////// DEPARTMENT OF COMPUTER SCIENCE

8 \\ ACKNOWLEDGMENT

Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number P20GM103546.

This material is based upon work supported by the National Science Foundation under grant no. 1845465.

NIH COBRE

NSF CAREER

6 \\ REFERENCES

1 \\ THE FIDO PROTEIN INFERENCE MODEL 2 \\ "HITCHHIKING" PROTEINS

3 \\ EVERGREENFOREST: A SOLVER FORPROBABILISTIC LINEAR DIOPHANTINE EQUATIONS

7 \\ AVAILABILITY

5 \\ TITIN SIMULATIONS

Available free (MIT license) from https://bitbucket.org/orserang/evergreenforest

You can use the EvergreenForest library as bith a header-only C++11 library or via the modeling language (run make in EvergreenForest/src/Language).

4 \\ PROTEIN INFERENCE MODELS

Proteins Peptides Peptide probabilities

0.98

0.9

X1

X2

Y1

Y2

Possible presentprotein sets:

Probability Y1 is absentgiven that protein set:

If a protein is present,how often will it makea peptide?

How often does apeptide show upfrom error?

A priori, percent ofproteins are actuallypresent?

The FIDO model, a Bayesian generalization of vertex-cover methods [1]:

We can introduce a change of variables to a "cardinal model":

Proteins Peptides Peptide probabilities

0.98

0.9

X1

X2

Y1

Y2

Y3

Y4

Decoy X1

Decoy X2

0.03

0.01

Actually absent

Actually present

Not actually considered

PMF (A,B) (0,0) [[0.15, 0.3, 0.1], [0.2, 0.3, 0.2]]PMF (B,C) (0,0) [[0.3, 0.2], [0.4, 0.1]]PMF (A,C) (-1,0) [[0.1, 0.2], [0.1, 0.3], [0.2, 0.1]]PMF (D) (-1) [0.1, 0.2, 0.3, 0.2, 0.1, 0.1]D-C=A+B

#change to brute force:@engine=brute_force()Pr("BRUTE FORCE RESULTS")Pr(A; B)Pr( )

#change to LBP:@engine=loopy(@dampening=0.05,@epsilon=1e-6,@max_iter=10000)Pr("LOOPY RESULTS")Pr(A;B)Pr( )

Target proteins may "hitchhike" when they share peptides with a present target. This is more common than a target protein sharing peptides with a decoy protein. It can result in many identifications with a very low FDR (all targets!), while in reality most of those targets are actually absent (actually high FDR!). Training for target-decoy discrimination may incentivize this bad result.

BRUTE FORCE RESULTSA PMF:{[0] to [1]} t:[0.486486, 0.513514]B PMF:{[0] to [1]} t:[0.432432, 0.567568]Log probability of model: -6.06943LOOPY RESULTSA PMF:{[0] to [1]} t:[0.47552, 0.52448]B PMF:{[0] to [1]} t:[0.416622, 0.583378]Log probability of model: -5.92171

REPRINT QRPhoto me for a reprint!

Automatic multithreading!

TRIOT: Every tensor operation is unrolled to the right number of for loops at runtime!

Lazy, trimmed convolution trees!

FIDO: The classic!EPIFANY: Uses a prior on each N variable, which penalizes multiple proteins per peptide. [3]Mutex: Strong prior on N variables: at most, one protein allowed per peptide.

Target-decoy labels:The classic! Choose parameters that maximizethe target-decoy discrimination and calibration.

Empirical Bayes:Choose the parameters that maximize likelihood.

PAIRED WITH

[1] O SERANG, M MACCOSS, AND W NOBLE. "EFFICIENT MARGINALIZATION TO COMPUTE PROTEIN POSTERIOR PROBABILITIES FROM SHOTGUN MASS SPECTROMETRY DATA." JOURNAL OF PROTEOME RESEARCH 9.10 (2010): 5346-5357.[2] O SERANG. "THE PROBABILISTIC CONVOLUTION TREE: EFFICIENT EXACT BAYESIAN INFERENCE FOR FASTER LC-MS/MS PROTEIN INFERENCE." PLOS ONE 9.3 (2014): E91507.[3] J PFEUFFER, TSACHSENBERG, T DIJKSTRA, O SERANG, K REINERT AND O KOHLBACHER. "EPIFANY - A METHOD FOR EFFICIENT HIGH-CONFIDENCE PROTEIN INFERENCE." (IN PREPARATION)

Cardinal models can be solved very efficiently usingprobabilistic convolution trees! [2] If we have n variables, each with states {0, 1, 2, ... k-1}, we can find all posteriors simultaneously in

Simulated using the following present proteins: sp|Q7L7L0|H2A3_HUMAN, sp|A6NJZ7|RIM3C_HUMAN, sp|P57059|SIK1_HUMAN, sp|Q5SQ80|A20A2_HUMAN, sp|Q99613|EIF3C_HUMAN, sp|P0DJD0|RGPD1_HUMAN, sp|Q5VU36|S31A5_HUMANsp|Q8WZ42|TITIN_HUMAN. These proteins have many shared peptides.

FIDO EPIFANY Mutex FIDO EPIFANY Mutex

Target-decoy parameter est. Empirical Bayes parameter est.

Avg. sensitivity@<10% FDR:

0.162 0.156 0.156 0.163 0.183 0.170

These simulations show that empirical Bayes is less resistant to overfitting in the context of hitchhiking; likewise, they show that empirical Bayes with EPIFANY and Mutex models, which are more hawkish about shared peptides, are less likely to overfit. Prototyping new models with EvergreenForest is easy!