PERFORMANCE MEASURES
Transcript of PERFORMANCE MEASURES
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Technical University of Denmark - DTUDepartment of systems biology
PERFORMANCE MEASURESOf predictive systems
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
DATA TYPES
Data point Value
A FALSEB TRUEC TRUED FALSEE FALSEF TRUEG FALSE
Data Point Value
a 32.3b 11.2b 2.1d 111.1e 1313f 0.65g 72.8
Binary Real Value
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
ACCURACY
46%54%
Category A Category B
1%
99%
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
MATTHEWS CORRELATION COEFFICIENT
A
A
Not A
Not A
TP
FP
FN
TNAccuracy
21/25= 84%A Not A
Prediction
Expe
rimen
tal
Ass
ignm
ent
PPV20/23 = 87%
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
MATTHEWS CORRELATION COEFFICIENT
A
A
Not A
Not A
TP
FP
FN
TNAccuracy
21/25= 84%
€
Sens =TPAP
Spec =TNAN
CC =TP ⋅TN − FN ⋅ FPPP ⋅ AN ⋅ AP ⋅ PN
A Not APrediction
Expe
rimen
tal
Ass
ignm
ent
PPV20/23 = 87%
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
MATTHEWS CORRELATION COEFFICIENT
A
A
Not A
Not A
TP
FP
FN
TNAccuracy
21/25= 84%
€
Sens =TPAP
Spec =TNAN
CC =TP ⋅TN − FN ⋅ FPPP ⋅ AN ⋅ AP ⋅ PN
= 0.39
= 0.95
= 0.25
A Not APrediction
Expe
rimen
tal
Ass
ignm
ent
PPV20/23 = 87%
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
SENSITIVITY/SPECIFICITY
A
Not A
€
Sens =TPAP
Spec =TNAN
CC =TP ⋅TN − FN ⋅ FPPP ⋅ AN ⋅ AP ⋅ PN
= 0.39
= 0.95
= 0.25
A Not A
TP
FP
FN
TN
= 0.25
A Not A
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
SENSITIVITY/SPECIFICITY
A
Not A
€
Sens =TPAP
Spec =TNAN
CC =TP ⋅TN − FN ⋅ FPPP ⋅ AN ⋅ AP ⋅ PN
A Not A
TP
FP
FN
TN
= 0.75
= 0.43
= 0.25= 0.37
A Not A
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
FROM REAL LIFE
!
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
REAL VALUE
0 0.2 0.4 0.6 0.8 1Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
REAL VALUE
PCC =�
i (ai − a)(pi − p)��i (ai − a)2
��i (pi − p)2
0 0.2 0.4 0.6 0.8 1Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS0 0.2 0.4 0.6 0.8 1
Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
ROC CURVES
Sens =TP
APSpec =
TN
AN
AUC =� 1
0f(x)dx
x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS0 0.2 0.4 0.6 0.8 1
Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
ROC CURVES
Sens =TP
APSpec =
TN
AN
AUC =� 1
0f(x)dx
x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS0 0.2 0.4 0.6 0.8 1
Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
ROC CURVES
Sens =TP
APSpec =
TN
AN
AUC =� 1
0f(x)dx
x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS0 0.2 0.4 0.6 0.8 1
Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
ROC CURVES
TP
Sens =TP
APSpec =
TN
AN
AUC =� 1
0f(x)dx
x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS0 0.2 0.4 0.6 0.8 1
Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
ROC CURVES
TP AP
Sens =TP
APSpec =
TN
AN
AUC =� 1
0f(x)dx
x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS0 0.2 0.4 0.6 0.8 1
Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
ROC CURVES
TP
TN
AP
Sens =TP
APSpec =
TN
AN
AUC =� 1
0f(x)dx
x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS0 0.2 0.4 0.6 0.8 1
Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
ROC CURVES
TP
TN
AP
AN
Sens =TP
APSpec =
TN
AN
AUC =� 1
0f(x)dx
x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS1 - spec
Sen
s
0 0.2 0.4 0.6 0.8 1Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
ROC CURVES
TP
TN
AP
AN
Sens =TP
APSpec =
TN
AN
AUC =� 1
0f(x)dx
x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS1 - spec
Sen
s
0 0.2 0.4 0.6 0.8 1Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
ROC CURVES
AUC=0.5
TP
TN
AP
AN
Sens =TP
APSpec =
TN
AN
AUC =� 1
0f(x)dx
x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS1 - spec
Sen
s
0 0.2 0.4 0.6 0.8 1Predicted affinity
0
0.2
0.4
0.6
0.8
1
Mea
sure
d af
finity
ROC CURVES
AUC=0.5
AUC=0.8
TP
TN
AP
AN
Sens =TP
APSpec =
TN
AN
AUC =� 1
0f(x)dx
x = 1 - specificity (false positive rate) and f(x) is sensitivity (true positive rate)
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial
CALCULATING A ROC CURVE
True positive False positive
False negative True negative
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial
CALCULATING A ROC CURVE
True positive False positive
False negative True negative
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial
CALCULATING A ROC CURVE
True positive False positive
False negative True negative
4
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial
CALCULATING A ROC CURVE
True positive False positive
False negative True negative
4 10
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial
CALCULATING A ROC CURVE
True positive False positive
False negative True negative
4 10
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial
CALCULATING A ROC CURVE
True positive False positive
False negative True negative
4 10 12
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial
CALCULATING A ROC CURVE
True positive False positive
False negative True negative
4 10 121
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
ECCB/ISMB-2009 - Immunological Bioinformatics Tutorial
CALCULATING A ROC CURVE
True positive False positive
False negative True negative
4 10 121 0.080.29
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
Threshold TP FN TP/(TP+FN) FP TN FP/(FP+TN)
>0,8 4 10 0,29 1 12 0,08
>0,6 8 6 0,57 3 10 0,23
>0,4 11 3 0,79 6 7 0,46
>0,2 13 1 0,93 9 4 0,69
>0 14 0 1 13 0 1
0
0.2500
0.5000
0.7500
1.0000
0 0.2500 0.5000 0.7500 1.0000
True
pos
itive
s ra
te
False positives rate
AUC = 0.5AUC = 1.0
AUC =� 1
0f(x)dx
AUC = 0.71
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
DEALING WITH SEQUENCE
REDUNDANCY
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
OUTLINE
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
OUTLINE
•What is data redundancy?
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
OUTLINE
•What is data redundancy?
•Why is it a problem?
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
OUTLINE
•What is data redundancy?
•Why is it a problem?
• How can we deal with it?
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
DATABASES ARE REDUNDANT
• Biological reasons
• Some protein functions, or sequence motifs are more common than others
• Laboratory artifacts
• Some protein families have been heavily investigated, others not
•Mutagenesis studies makes large and almost identical replica
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
DATA REDUNDANCY
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
10 MHC restricted peptides
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
What is similarity?
• Sequence identity?
• Blast e-values– Often too conservative
• Other
DFLKKVPDDHLEFIPYLILGEVFPEWDERELGVGEKLLIKAVA------------MATGIDAKEIEESVKDTGDL-GEDVLLGADDGSLAFVP---------- SEFSISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGE
ACDFGACEFG 80% ID versus 24% ID
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
OLE LUND ET AL.(PROTEIN ENGINEERING 1997)
%ID = 290/sqrt(alen)
Alen=100; %ID=29
Alen=30: %ID=53
DSSP secondary structure identity in alignments as a function of thealignment length and the percent sequence identity
Monday, 7 June 2010
Technical University of Denmark - DTUDepartment of systems biology
CE
NT
ER
FOR
BIO
LOG
ICA
L SE
QU
EN
CE
AN
ALY
SIS
MHC BINDING PEPTIDES
9mer : %id =290√
9= 97%
15mer : %id =290√
15= 75%
89
= 89% < 97%
1115
= 73% < 75%
Monday, 7 June 2010