Bioinformatics and Statistics: A Real World Example Joseph D. Szustakowski.
-
date post
18-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Bioinformatics and Statistics: A Real World Example Joseph D. Szustakowski.
Bioinformatics and Statistics: A Real World Example
Joseph D. Szustakowski
Words of Encouragement
• “There are three kinds of lies: lies, damned lies, and statistics” – Benjamin Disraeli
• “Statistics in the hands of an engineer are like a lamppost to a drunk – they’re used more for support than illumination”
• “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates
Outline
• Basic idea – what are we trying to do?
• Extreme Value Distribution – a brief review
• Overwhelm audience with lots of pictures
The Basic Idea
• Most experiments result in one or more quantitative measurements– Height, length, weight, time, speed– SW score, threading potential, Viterbi score
• Is that measurement unusual?– Tall or short; heavy or light; fast or slow– ‘good’ or ‘bad’; homologous or non-
homologous
The Basic Idea
• ‘Unusualness’ is only definable if we know what ‘usual’ is.– Make lots of random measurements– Model the ‘background’ distribution– Compare measurements of interest to the
background
What’s the Point?
• Are our results good / bad / the same / meaningful / garbage ?
• Consultation with an oracle– Definitive– Elusive
• Magic eight ball– Readily available– Inconsistent results
What’s the Point?
• Statistics - – Readily available– Reproducible– Provide an estimate of how likely a better /
worse / same result can be obtained by chance.
Background Distributions
• Gaussian – sum of independent variables (central limit theorem)
• Extreme Value Distributions – optimization procedures
Extreme Value Distributions – a Brief Review
• Extreme value distributions often result from optimization procedures– Sequence alignments (BLAST, SW)– Viterbi algorithm (HMMER, SAM)
• EVDs are skewed
• EVDs have a ‘heavy’ tail
Real World Example• Protein structure alignment
– Identify equivalent backbone positions in two proteins– Maximize the number of equivalent pairs– Minimize the distance between pairs
• K2– Target function to evaluate alignments– Searches for the best alignment
• Dynamic programming• Weighted bipartite matching• Genetic algorithm• Simulated annealing• Kitchen sink (in progress)
Serine Proteases
• Yellow – human protease
• Red – viral coat protein
• Asp-His-Ser catalytic triad shown in balls and sticks
DNA Methyltransferases
1 2 3456 7 A B CDE Z
1 2 3456 7 A B CDE Z
NCN
CM.TaqI
M.PvuII
Background Distribution
ln( )Z S L
ZZ eZ e e
Extreme value distribution
( ) 1ZeP Z e
P-Value
Table 2 K2 and BLAST Sensitivities
Positive Set LL LR UL UR TotalK2
SensitivityBLAST
Sensitivity
Fold95% Identity 31909 133709 79 34092 199789 83% 16%
Fold40% Identity 1582 22237 8 7927 31754 75% 5%
Superfamily95% Identity 31909 95722 79 16541 144251 89% 22%
Superfamily40% Identity 1582 11519 8 2873 15982 82% 10%