Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 Sushmita Roy...

48
Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy [email protected] Sep 11 th , 2014

Transcript of Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 Sushmita Roy...

Page 1: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Scores and substitution matrices in sequence alignment

Sushmita RoyBMI/CS 576

www.biostat.wisc.edu/bmi576/Sushmita Roy

[email protected] 11th, 2014

Page 2: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Key concepts in today’s class

• Probability distributions • Discrete and continuous continuous distributions• Joint, conditional and marginal distributions• Statistical independence • Probabilistic interpretation of scores in alignment algorithms• Different substitution matrices• Estimating simple substitution matrices• Assessing significance of scores

Page 3: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

RECAP

• Four issues in sequence alignment– Type of alignment– Algorithm for alignment– Scores for alignment– Significance of alignment scores

Page 4: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Guiding principles of scores in alignments

• We want to know whether the alignment we observed is meaningful or by chance

• By meaningful we mean whether the sequences represent a similar biological function

• To study whether what we observe by chance we need to understand some concepts in probability theory

Page 5: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

PROBABILITY PRIMER

Page 6: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Definition of probability

• Intuitively, we use “probability” to refer to our degree of confidence in an event of an uncertain nature.

• Always a number in the interval [0,1]0 means “never occurs”1 means “always occurs”

• Frequentist interpretation: the fraction of times an event will be true if we repeated the experiment indefinitely

• Bayesian interpretation: degree of belief in an event

Page 7: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Sample spaces

• Sample space: a set of possible outcomes for some experiment• Examples– Flight to Chicago: {on time, late}– Lottery: {ticket 1 wins, ticket 2 wins,…,ticket n wins}– Weather tomorrow:

{rain, not rain} or{sun, rain, snow} or{sun, clouds, rain, snow, sleet}

– Roll of a die: {1,2,3,4,5,6}– Coin toss: {Heads, Tail}

Page 8: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Random variables

• Random variable: A variable that represents the outcome of an experiment

• A random variable can be – Discrete: Outcomes take a fixed set of values

• Roll of die, flight to chicago, weather tomorrow

– Continuous: Outcomes take continuous values• Height, weight

Page 9: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Notation

• Uppercase letters and words denote random variables– X, Y

• Lowercase letters and words denote values– x, y

• Probability that X takes value x

• We’ll also use the shorthand form

• For Boolean random variables, we’ll use the shorthand

Page 10: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Discrete Probability distributions

• A probability distribution is a mathematical function that specifies the probability of each possible outcome of a random variable

• We denote this as P(X) for random variable X • It specifies the probability of each possible value of X, x• Requirements:

sun

cloud

sra

insn

ow sleet

0.2

0.3

0.1

Page 11: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Joint probability distributions

• Joint probability distribution: the function given by P(X = x, Y = y)

• Read “X equals x and Y equals y”• Example

x, y P(X = x, Y = y)

sun, on-time 0.20

rain, on-time 0.20

snow, on-time 0.05

sun, late 0.10

rain, late 0.30

snow, late 0.15

probability that it’s sunny and my flight is on time

Page 12: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Marginal probability distributions

• The marginal distribution of X is defined by

“the distribution of X ignoring other variables”

• This definition generalizes to more than two variables, e.g.

Page 13: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Marginal distribution example

x, y P(X = x, Y = y)

sun, on-time 0.20

rain, on-time 0.20

snow, on-time 0.05

sun, late 0.10

rain, late 0.30

snow, late 0.15

x P(X = x)

sun 0.3

rain 0.5

snow 0.2

joint distribution marginal distribution for X

Page 14: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Conditional distributions

• The conditional distribution of X given Y is defined as:

• Or in short

• The distribution of X given that we know the value of Y

• Intuitively, how much does knowing Y tell us about X?

Page 15: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Conditional distribution example

x, y P(X = x, Y = y)

sun, on-time 0.20

rain, on-time 0.20

snow, on-time 0.05

sun, late 0.10

rain, late 0.30

snow, late 0.15

x P(X = x|Y=on-time)

sun 0.20/0.45 = 0.444

rain 0.20/0.45 = 0.444

snow 0.05/0.45 = 0.111

joint distributionconditional distribution for X given Y=on-time

Page 16: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Independence

•Two random variables, X and Y, are independent if

•Another way to think about this is knowing X does not tell us anything about Y

Page 17: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Independence example #1

x, y P(X = x, Y = y)

sun, on-time 0.20

rain, on-time 0.20

snow, on-time 0.05

sun, late 0.10

rain, late 0.30

snow, late 0.15

x P(X = x)

sun 0.3

rain 0.5

snow 0.2

joint distribution marginal distributions

y P(Y = y)

on-time 0.45

late 0.55

Are X and Y independent here? NO.

Page 18: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Independence example #2

x, y P(X = x, Y = y)

sun, fly-United 0.27

rain, fly-United 0.45

snow, fly-United 0.18

sun, fly-Northwest 0.03

rain, fly-Northwest 0.05

snow, fly-Northwest 0.02

x P(X = x)

sun 0.3

rain 0.5

snow 0.2

joint distribution marginal distributions

y P(Y = y)

fly-United 0.9

fly-Northwest 0.1

Are X and Y independent here? YES.

Page 19: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

• Two random variables X and Y are conditionally independent given Z if

“once you know the value of Z, knowing Y doesn’t tell you anything about X ”

• Alternatively

Conditional independence

Page 20: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Conditional independence exampleFlu Fever Headache P

true true true 0.04

true true false 0.04

true false true 0.01

true false false 0.01

false true true 0.009

false true false 0.081

false false true 0.081

false false false 0.729

Are Fever and Headache independent? NO.

Page 21: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Conditional independence exampleFlu Fever Headache P

true true true 0.04

true true false 0.04

true false true 0.01

true false false 0.01

false true true 0.009

false true false 0.081

false false true 0.081

false false false 0.729

Are Fever and Headache conditionally independent given Flu: YES.

Page 22: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Chain rule of probability

• For two variables

• For three variables

• etc.• to see that this is true, note that

Page 23: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Example discrete distributions

• Binomial distribution

• Multinomial distribution

Page 24: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

• Two outcomes per trial of an experiment• Distribution over the number of successes in a fixed number n of independent

trials (with same probability of success p in each)

• e.g. the probability of x heads in n coin flips

The binomial distributionP

(X=

x)

p=0.5 p=0.1

x x

P(X

=x)

Page 25: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

• k possible outcomes on each trial• Probability pi for outcome xi in each trial• Distribution over the number of occurrences xi for each

outcome in a fixed number n of independent trials

• e.g. with k=6 (a six-sided die) and n=30

The multinomial distribution

vector of outcomeoccurrences

Page 26: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Continuous random variables• When our outcome is a continuous number we need a

continuous random variable• Examples: Weight, Height• We specify a density function for random variable X as

• Probabilities are specified over an interval to derive probability values

• Probability of taking on a single value is 0.

Page 27: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Continuous random variables contd

• To define a probability distribution for a continuous variable, we need to integrate f(x)

Page 28: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Examples of continuous distributions

• Gaussian distribution

• Exponential distribution

• Extreme Value distribution

Page 29: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Gaussian distribution

• Gaussian distribution

Page 30: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Extreme Value Distribution

• Used for describing the distribution of extreme values of another distribution

Max values from 1000 sets of 500 samples from a standard normal distribution

f(x)

Page 31: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

END PROBABILITY PRIMER

Page 32: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Guiding principles of scores in alignments

• We need to assess whether an alignment is biologically meaningful

• Compute the probability of seeing an alignment under two models– Related model R– Unrelated model U

Page 33: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

R: related model

• Assume each pair of aligned positions evolved from a common ancestor

• Let pab be the probability of observing a pair {a,b}

• Probability of an alignment between x and y is

Page 34: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

U: unrelated model

• Assume the individual amino acids at a position are independent of the amino acid in another position.

• Let qa be the probability of a• The probability of an n-character alignment of x and y is

Page 35: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Determine which model is more likely

• The score of an alignment is the relative likelihood of the alignment from U and R

• Taking the log we get

Page 36: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Score of an alignment

Substitution matrix entry should thus be

Score of an alignment is

Page 37: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

How to estimate the probabilities?

• Need a good set of confirmed alignments• Depends upon what we know about when the two sequences

might have diverged– pab for closely related species is likely to be low if a !=b

– pab for species that have diverged a long time ago is likely close to the background.

Page 38: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Some common substitution matrices

• BLOSUM matrices [Henikoff and Henikoff, 1992]– BLOSUM45– BLOSUM50– BLOSUM62

• Number represents percent identity of sequences used to construct substitution matrices

• PAM [Dayhoff et al, 1978]• Empirically, BLOSUM62 works the best

Page 39: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

BLOSUM matrices

• BLOck Substitution Matrix• Derived from a set of aligned ungapped regions from protein

families called BLOCKS– Reside in the BLOCKS database

• Cluster proteins such that they have no less than L% of similarity– BLOSUM50: Proteins >50% similarity are in the same group– BLOSUM62: Proteins >62% similarity are in the same group

• Calculate substitution frequencies Aab of observing a in one cluster and b in another cluster

Page 40: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Example substitution scoring matrix (BLOSUM62)

Page 41: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Estimating the probabilities in BLOSUM

Number of occurrences of a

Number of occurrences of a and b

Page 42: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Example of BLOSUM matrix calculation

1 A T C K Q2 A T C R N3 A S C K N4 S S C R N

5 S D C E Q6 S E C E N

7 T E C R Q

Three blocks at 50% similaritySeven blocks at 62% similarity

Probabilities at 62% similarity

Page 43: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Assessing significance of the alignment score

• There are two ways to do this– Bayesian framework

– Classical approach

Page 44: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

The classical approach to assessing sequence :Extreme Value Distribution

• Suppose we have a particular substitution matrix and amino-acid frequencies

• We need to consider random sequences of lengths m and n and finding the best alignment of these sequences

• This will give us a distribution over alignment scores for random pairs of sequences

• If the probability of a random score being greater than our alignment score is small, we can consider our score to be significant

Page 45: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

The extreme value distribution• we’re picking the best alignments, so we want to know what

the distribution of max scores for alignments against a random set of sequences looks like

• this is given by an extreme value distribution

x

P(x)

Page 46: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Assessing significance of sequence score alignments

• It can be shown that the mode of the distribution for optimal scores is

– K, λ estimated from the substitution matrix

• Probability of observing a score greater than S

Page 47: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Bayes theorem

• An extremely useful theorem• There are many cases when it is hard to estimate P(x | y)

directly, but it’s not too hard to estimate P(y | x) and P(x)

Page 48: Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 11 th,

Bayes theorem example

• MDs usually aren’t good at estimating P(Disorder| Symptom)

• They’re usually better at estimating P(Symptom | Disorder) • If we can estimate P(Fever | Flu) and P(Flu) we can use Bayes’

Theorem to do diagnosis