Protein Prediction II Exercise - ROSTLAB.ORG fileProtein Prediction II Exercise . L. Richter Task...

6
L. Richter Protein Prediction II Exercise

Transcript of Protein Prediction II Exercise - ROSTLAB.ORG fileProtein Prediction II Exercise . L. Richter Task...

L. Richter

Protein Prediction II Exercise

L. Richter

Task Schedule 17.10   24.10   31.10   7.11  

Orga/Group formation Accounts for i12r-biolab-machines  

Get familiar with HPO database and test set sequences, data preparation  

Find most similar sequences (with n different methods: Blast, HHBlits) write scripts with parameter n  

Extract HPO paths for each sequence, design data structures to store and manipulate paths and trees  

14.11   21.11   28.11   5.12  

Merge paths into trees   Merge trees from different neighbors  

Implement performance measures  

Complete parameter estimation, make final predictions  

10.12 Tuesday   19.12   9.1   16.1  

Midterm review / Handover of predictions for CAFA submission  

Evaluation of improvements   Integrate on meta-server platform    

Server integration    

23.1.   30.1.   4.2    

Final Presentation    

Hints  for  Publication  Writing   Exam   -----  

8

L. Richter

Performance Measurement Nov 28th • Use the same performance measures as described in

Radivojac et al. in Nature Methods 10(3), March 2013 pp221-230; doi:10.1038/nmeth.2340

• Read also Hamp et al. in BMC Bioinformatics 2013,14(Suppl 3):S7. http://www.biomedcentral.com/1471-2105/14/S3/S7

•  implement the measures for precision and recall •  try to run with threshold steps of 0.1 and construct a

precision / recall curve •  get a sufficient number of data points for the curve •  e.g. chose n high (10/20), use a fine grained score for the

terms •  iterate over different values for the different parameters

24

L. Richter

Performance Measurement • Precision:

• Recall:

•  t: threshold, i.e. probability of being true, 0 ≤ t ≤ 1 •  f: functional term from an ontology •  i: a given target protein • Ti: is a set of experimentally determined terms • Pi(t): is a set of predicted terms for protein i with a score

greater than or equal to t •  I(): is the standard indicator function

25

pri (t) =I( f ! Pi(t )" f ! Ti )f#

I( f ! Pi (t))f#

rci (t) =I( f ! Pi(t )" f ! Ti )f#

I( f ! Ti )f#

L. Richter

26

Item Set and Association Rule Weights Classification Regression

Complex Measures – Performance Curves

Recall Precision Curves

Taken from http://scikit-learn.github.io/scikit-learn.org/

preferred in informationretrievalpositives are thedocuments retrieved inresponse to a querytrue positives aredocuments really relevantto the queryy -axis: precisionx-axis: recall

Richter, Cejuela Technische Universität München

MiniTalk3

L. Richter

Performance Measurement • Combine both numbers into one:

• Fmax=

•  optional: Do the term-centric metrics

27

maxt

2 ! pr(t) ! rc(t)pr(t)+ rc(t)

"#$

%&'

snf (t) =I( f ! Pi(t )" f ! Ti )i#

I( f ! Ti )i#

spf (t) =I( f ! Pi(t )" f ! Ti )i#

I( f ! Ti )i#