X-Ray Severity Score better than General Obesity Abdominal ...
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x...
-
Upload
wilfred-lenard-doyle -
Category
Documents
-
view
225 -
download
0
Transcript of Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x...
Evolution and Scoring Rules
Example Score = 5 x (# matches) + (-4) x (# mismatches) +
+ (-7) x (total length of all gaps)
Example Score = 5 x (# matches) + (-4) x (# mismatches) +
+ (-5) x (# gap openings) + (-2) x (total length of all gaps)
Scoring Matrices
Scoring Rules vs. Scoring Matrices Nucleotide vs. Amino Acid Sequence The choice of a scoring rule can strongly
influence the outcome of sequence analysis Scoring matrices implicitly represent a
particular theory of evolution Elements of the matrices specify the
similarity of one residue to another
DNA: A T G C
1:1
RNA: A U G C
3:1
Protein: 20 amino acids
Transcription
Translation
Replication
Translation - Protein Synthesis: Every 3 nucleotides (codon) are translated into one amino acid
Nucleotide sequence determines the amino acid sequence
Translation - Protein Synthesis
5’ -> 3’ : N-term -> C-term RNA Protein
Log Likelihoods used as Scoring Matrices:
PAM - % Accepted Mutations:1500 changes in 71 groups w/ > 85%
similarity
BLOSUM – Blocks Substitution Matrix:2000 “blocks” from 500 families
Log Likelihoods used as Scoring Matrices:
BLOSUM
ji
ijij pp
pS 2log2
Likelihood Ratio for Aligning a Single Pair of Residues
•Above: the probability that two residues are aligned by evolutionary descent
•Below: the probability that they are aligned by chance
•Pi, Pj are frequencies of residue i and j in all protein sequences (abundance)
ji
ijij pp
p
ji
jiS log
chance)by | withalignedPr(
ancestry)common | withalignedPr(log
Likelihood Ratio of Aligning Two Sequences
tjsiij
ji
ij
ji
ij Spp
p
pp
p
ji
ji
,
loglog
chance)by | withalignedPr(
ancestry)common | withalignedPr(log
)chanceby |alignmentPr(
)ancestrycommon |alignmentPr(log
alignment of ratiolik log
The alignment score of aligning two sequences is the log likelihood ratio of the alignment under two models Common ancestry By chance
PAM and BLOSUM matrices are all log likelihood matrices
More specificly: An alignment that scores 6 means
that the alignment by common ancestry is 2^(6/2)=8 times as likely as expected by chance.
ji
ijij pp
pS 2log2
BLOSUM matrices for Protein
S. Henikoff and J. Henikoff (1992). “Amino acid substitution matrices from protein blocks”. PNAS 89: 10915-10919
Training Data: ~2000 conserved blocks from BLOCKS database. Ungapped, aligned protein segments. Each block represents a conserved region of a protein family
Constructing BLOSUM Matrices of Specific Similarities
Sets of sequences have widely varying similarity. Sequences with above a threshold similarity are clustered.
If clustering threshold is 62%, final matrix is BLOSUM62
A toy example of constructing a BLOSUM matrix from 4
training sequences
Constructing a BLOSUM matr.1. Counting mutations
Constructing a BLOSUM matr.2. Tallying mutation frequencies
Constructing a BLOSUM matr.3. Matrix of mutation probs.
4. Calculate abundance of each residue (Marginal prob)
5. Obtaining a BLOSUM matrix
Constructing the real BLOSUM62 Matrix
1.2.3.Mutation Frequency Table
1000ijP
4. Calculate Amino Acid Abundance
acid aminoeach of likelihood marginal the: ip
5. Obtaining BLOSUM62 Matrix
ji
ijij pp
pS 2log2
PAM Matrices (Point Accepted Mutations)
Mutations accepted by natural selection
PAM Matrices Accepted Point Mutation Atlas of Protein Sequence and Structure,
Suppl 3, 1978, M.O. Dayhoff.
ed. National Biomedical Research Foundation, 1
Based on evolutionary principles
Constructing PAM Matrix: Training Data
PAM: Phylogenetic Tree
PAM: Accepted Point Mutation
Mutability
Total Mutation Rate
is the total mutation rate of all amino acids
Normalize Total Mutation Rate
Mutation Probability Matrix Normalized
Such that the Total Mutation Rate is 1%
Mutation Probability Matrix (transposed) M*10000
-- PAM1 mutation prob. matr. --PAM2 Mutation Probability Matrix?
-- Mutations that happen in twice the evolution period of that for a PAM1
)1(M)2(M
PAM Matrix: Assumptions
In two PAM1 periods: {AR} = {AA and AR} or {AN and NR} or {AD and DR} or … or {AV and VR}
period) 2ndin RPr(Dperiod)1st in DPr(A
period) 2ndin RPr(Nperiod)1st in NPr(A
period) 2ndin RPr(Aperiod)1st in A Pr(A
periods) 2in RAPr(
DRADNRANARAAAR PPPPPPP )2(
Entries in a PAM-2 Mut. Prob. Matr.
PAM-k Mutation Prob. Matrix
KK MM
MMM
}{ )1()(
)1()1()2(
PAM-1 log likelihood matrix
ji
ijij pp
PS
)1(
10log10
PAM-k log likelihood matrix
ji
kij
ijk
pp
PS
)(
10)( log10
PAM-250
PAM60—60%, PAM80—50%, PAM120—40% PAM-250 matrix provides a better
scoring alignment than lower-numbered PAM matrices for proteins of 14-27% similarity
Sources of Error in PAM
Comparing Scoring MatrixPAM
Based on extrapolation of a small evol. Period
Track evolutionary origins Homologous seq.s during
evolution
BLOSUM Based on a range of
evol. Periods Conserved blocks Find conserved
domains
Choice of Scoring Matrix
Global Alignment with Affine Gaps
Complex Dynamic Programming
Problem w/ Independent Gap Penalties The occurrence of x consecutive
deletions/insertions is more likely than the occurrence of x isolated mutations
We should penalize x long gap less than x
times of the penalty for one gap
Affine Gap Penalty
w2 is the penalty for each gap w1 is the _extra_ penalty for the
1st gap
Scoring Rule not Additive! We need to know if the current gap
is a new gap or the continuation of an existing gap
Use three Dynamic Programming matrices to keep track of the previous step
S1 is the vertical sequenceS2 is the horizontal sequence (From Diagonal) a(i,j): current position
is a match (From Left) b(i,j): current position is a
gap in S1 (From Above) c(i,j): current position is a
gap in S2Filling the next element in each matrix
depends on the previous step, which is stored in the three matrices.
Last step a match
a gap in S2
a gap in S1
new gap in S2
a continued gap in S2
a gap in S2 following a gap in S1
Decisions in Seq. Alignment Local or global alignment? Which program to use Type of scoring matrix Value of gap penalty
Aij*10
PAM-k log-likelihood matrix