Consistent probabilistic outputs for protein function prediction

39
Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

description

Consistent probabilistic outputs for protein function prediction. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington. Outline. Motivation and background Methods Shared base method Reconciliation methods Results. - PowerPoint PPT Presentation

Transcript of Consistent probabilistic outputs for protein function prediction

Page 1: Consistent probabilistic outputs for protein function prediction

Consistent probabilistic outputs for protein function prediction

William Stafford NobleDepartment of Genome Sciences

Department of Computer Science and EngineeringUniversity of Washington

Page 2: Consistent probabilistic outputs for protein function prediction

Outline

• Motivation and background

• Methods– Shared base method– Reconciliation methods

• Results

Page 3: Consistent probabilistic outputs for protein function prediction

The problem

Given:• protein sequence,• knockout phenotype,• gene expression

profile,• protein-protein

interactions, and • phylogenetic profile

Predict• a probability for every

term in the Gene Ontology

Heterogeneous dataMissing dataMultiple labels per geneStructured output

Page 4: Consistent probabilistic outputs for protein function prediction

Consistent predictions

Cytoplasmic membrane-bound

vesicle(GO:0016023)

Cytoplasmic vesicle

(GO:0031410)

is a

The probability that protein X is a cytoplasmic membrane-bound vesicle must be less than or equal to the probability that protein X is a cytoplasmic vesicle.

Page 5: Consistent probabilistic outputs for protein function prediction

Data sets

Page 6: Consistent probabilistic outputs for protein function prediction
Page 7: Consistent probabilistic outputs for protein function prediction

Kernels

Page 8: Consistent probabilistic outputs for protein function prediction

SVM → Naïve BayesData 1

Data 2

Data 3

Data 4

Data 5

Data 6

Data 7

Data 8

Data 33

SVM/AL 1

SVM/AL 2

SVM/AL 3

SVM/AL 4

SVM/AL 5

SVM/AL 6

SVM/AL 7

SVM/AL 8

SVM/AL 33

Product, plus Bayes’ rule

Probability 1

Probability 2

Probability 3

Probability 4

Probability 6

Probability 8

Probability 33

Probability

Gaussian

Asymmetric Laplace

Page 9: Consistent probabilistic outputs for protein function prediction

SVM → logistic regressionData 1

Data 2

Data 3

Data 4

Data 5

Data 6

Data 7

Data 8

Data 33

SVM 1

SVM 2

SVM 3

SVM 4

SVM 5

SVM 6

SVM 7

SVM 8

SVM 33

Logisticregressor 1

Logisticregressor 2

Logisticregressor 3

Logisticregressor 11

Predict 1

Predict 2

Predict 3

Predict 4

Predict 6

Predict 8

Predict 33

Probability

Page 10: Consistent probabilistic outputs for protein function prediction

Reconciliation Methods

• 3 heuristic methods

• 3 Bayesian networks

• 1 cascaded logistic regression

• 3 projection methods

Page 11: Consistent probabilistic outputs for protein function prediction

Heuristic methods

• Max: Report the maximum probability of self and all descendants.

• And: Report the product of probabilities of all ancestors and self.

• Or: Compute the probability that at least one descendant of the GO term is “on,” assuming independence.

jDj

i ppi

ˆmax

• All three methods use probabilities estimated by logistic regression.

iAj

ji pp ˆ

iDj

ji pp ˆ11

Page 12: Consistent probabilistic outputs for protein function prediction

Bayesian network

• Belief propagation on a graphical model with the topology of the GO.

• Given Yi, the distribution of each SVM output Xi is modeled as an independent asymmetric Laplace distribution.

• Solved using a variational inference algorithm.• “Flipped” variant: reverse the directionality of edges in the graph.

Page 13: Consistent probabilistic outputs for protein function prediction

Cascaded logistic regression

• Fit a logistic regression to the SVM output only for those proteins that belong to all parent terms.

• Models the conditional distribution of the term, given all parents.

• The final probability is the product of these conditionals:

iAj

ji pp

Page 14: Consistent probabilistic outputs for protein function prediction

Ejipp

pp

ij

Iiii

Iipi

, , s.t.

ˆ 2

,min

Isotonic regression

• Consider the squared Euclidean distance between two sets of probabilities.

• Find the closest set of probabilities to the logistic regression values that satisfy all the inequality constraints.

Page 15: Consistent probabilistic outputs for protein function prediction

Ejipp

pp

ij

Iiii

Iipi

, , s.t.

ˆ 2

,min

Ejipp

ppD

ij

Iiii

Iipi

, , s.t.

ˆmin ,

Isotonic regression

• Consider the squared Euclidean distance between two sets of probabilities.

• Find the closest set of probabilities to the logistic regression values that satisfy all the inequality constraints.

Page 16: Consistent probabilistic outputs for protein function prediction

Küllback-Leibler projection

• Küllback-Leibler projection on the set of distributions which factorize according to the ontology graph.

• Two variants, depending on the directions of the edges.

Page 17: Consistent probabilistic outputs for protein function prediction

Likelihood ratiosobtained from

logistic regression

Hybrid method

• Replace the Bayesian log posterior for Yi by the marginal log posterior obtained from the logistic regression.

• Uses discriminative posteriors from logistic regression, but still uses a structural prior.

BPAL KLP

BPLR

Page 18: Consistent probabilistic outputs for protein function prediction

Axes of evaluation

• Ontology– biological process– cellular compartment– molecular function

• Term size – 3-10 proteins– 11-30 proteins– 31-100 proteins– 100-200 proteins

• Evaluation mode– Joint evaluation– Per protein– Per term

• Recall– 1%– 10%– 50%– 80%

Page 19: Consistent probabilistic outputs for protein function prediction

Legend

Belief propagation, asymmetric Laplace

Belief propagation, asymmetric Laplace, flipped

Belief propagation, logistic regression

Cascaded logistic regression

Isotonic regression

Logistic regressionKüllback-Leibler projection

Küllback-Leibler projection, flipped

Naïve Bayes, asymmetric Laplace

Page 20: Consistent probabilistic outputs for protein function prediction

Pre

cisi

on

TP

/(T

P+

FP

)

Recall TP / (TP+FN)

Joint evaluation

Biological process ontology

Large terms (101-200)

Page 21: Consistent probabilistic outputs for protein function prediction

Biological process ontology

Page 22: Consistent probabilistic outputs for protein function prediction

Molecular function ontology

Page 23: Consistent probabilistic outputs for protein function prediction

Cellular compartment

ontology

Page 24: Consistent probabilistic outputs for protein function prediction

Conclusions: Joint evaluation

• Reconciliation does not always help.

• Isotonic regression performs well overall, especially for recall > 20%.

• For lower recall values, both Küllback-Leibler projection methods work well.

Page 25: Consistent probabilistic outputs for protein function prediction

Average precision per protein

Biological process

All term sizes

Page 26: Consistent probabilistic outputs for protein function prediction

Biological process

Page 27: Consistent probabilistic outputs for protein function prediction

Statistical significanceBiological process

Large terms

Page 28: Consistent probabilistic outputs for protein function prediction

Biological process

Large terms

Page 29: Consistent probabilistic outputs for protein function prediction

Biological process

101-

200

31-1

0011

-30

3-10

953 proteins

435 proteins

239 proteins

100 proteins

Page 30: Consistent probabilistic outputs for protein function prediction

Molecular function

101-

200

31-1

0011

-30

3-10

476 proteins

142 proteins

111 proteins

35 proteins

Page 31: Consistent probabilistic outputs for protein function prediction

Cellular component

101-

200

31-1

0011

-30

3-10

196 proteins

135 proteins

171 proteins

278 proteins

Page 32: Consistent probabilistic outputs for protein function prediction

Conclusions: per protein

• Several methods perform well– Unreconciled logistic regression– Unreconciled naïve Bayes– Isotonic regression– Belief propagation with asymmetric Laplace

• For small terms– For molecular function and biological process, we do

not observe many significant differences.– For cellular components, belief propagation with

logistic regression works well.

Page 33: Consistent probabilistic outputs for protein function prediction

Average precision per term

Biological process

All term sizes

Page 34: Consistent probabilistic outputs for protein function prediction

Biological process

101-

200

31-1

0011

-30

3-10

953 terms

435 terms

239 terms

100 terms

Page 35: Consistent probabilistic outputs for protein function prediction

Molecular function

101-

200

31-1

0011

-30

3-10

476 terms

142 terms

111 terms

35 terms

Page 36: Consistent probabilistic outputs for protein function prediction

Cellular component

101-

200

31-1

0011

-30

3-10

152 terms

97 terms

48 terms

30 terms

Page 37: Consistent probabilistic outputs for protein function prediction

Conclusions

• Reconciliation does not always help.

• Isotonic regression (IR) performs well overall.

• For small biological process and molecular function terms, it is less clear that IR is one of the best methods.

Page 38: Consistent probabilistic outputs for protein function prediction

Acknowledgments

Guillaume Obozinski

Charles Grant

Gert Lanckriet

Michael Jordan

The mousefunc organizers• Tim Hughes• Lourdes Pena-Castillo• Fritz Roth• Gabriel Berriz• Frank Gibbons

Page 39: Consistent probabilistic outputs for protein function prediction

Per term for small termsBiological process

Molecular function

Cellular component