Learning probabilistic logic models from probabilistic examples
Introduction to Probabilistic Sequence Models: Theory and Applications
description
Transcript of Introduction to Probabilistic Sequence Models: Theory and Applications
Introduction to Probabilistic Sequence Models:
Theory and Applications
David H. Ardell,Forskarassistent
Lecture Outline: Intro. to Probabilistic Sequence Models
Motif Representations: Consensus Sequences, Motifs and Blocks, Regular Expressions
Probabilistic Sequence Models: profiles, HMMs, SCFG
Consensus sequences revisited
Consense sequences make poor summaries
A T C G
A motif is a short stretch of protein sequence associated with a particular function (R. Doolittle, 1981)
The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins
[GA]x(4)GK[ST]
A variety of databases of such motifs exist: such as BLOCKS, PROSITE, and PRINTS, and there are many tools to search proteins for matches to blocks.
Introduction to Regular Expressions (Regexes)
Regular Expressions specify sets of sequences that match a pattern.
Ex: a[bc]a matches "aba" and "aca"
In addition to literals like a and b in the last example, regular expressions provide quantifiers like * (0 or more), + (1 or more), ? (0 or 1) and {N,M} (between N and M):
Ex: a[bc]*a matches "aa", "aba", "acca", "acbcba" etc
As well as grouping constructions like character classes [xy], compound literals like (this)+, and logical relations, like | which means "or" in (this|that)
Anchors match the beginning ^ and end $ of strings
IUPAC DNA ambiguity codes as reg-ex classes
Pyrimidines Y = [CT]
PuRines R = [AG]
Strong S = [CG]
Weak W = [AT]
Keto K = [GT]
aMino M = [AC]
B B = [CGT] (one letter greater than A=not-A)
D D = [AGT]
H H = [ACT]
V V = [ACG]
Any base N = [ACGT]
Regular Expressions are like machines that eat sequences one letter at a time
Begina [bc] a
End
Ex: a[bc]+a matching "ghghgacbah"
[bc]
[^bc]
[^bc][^a]
Regular Expressions are like machines that eat sequences one letter at a time
ghstu…a [bc] a
End
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
Regular Expressions are like machines that eat sequences one letter at a time
hstua…a [bc] a
End
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
Regular Expressions are like machines that eat sequences one letter at a time
stuac…a [bc] a
End
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
Regular Expressions are like machines that eat sequences one letter at a time
tuacb…a [bc] a
End
Ex: a[bc]+a matching "ghstugacbah"
[bc]
[^bc]
[^bc][^a]
Regular Expressions are like machines that eat sequences one letter at a time
uacbaha [bc] a
End
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
Regular Expressions are like machines that eat sequences one letter at a time
acbaha [bc]
[bc]
aEnd
[^bc]
[^bc]
Ex: a[bc]+a matching "ghstuacbah"
[^a]
Regular Expressions are like machines that eat sequences one letter at a time
Begina [bc] a
End
Ex: a[bc]+a matching "ghstuacbah"
cbah
[bc]
[^bc]
[^bc][^a]
Regular Expressions are like machines that eat sequences one letter at a time
a [bc] aEnd
Ex: a[bc]+a matching "ghstuacbah"
bah
[bc]
[^bc]
[^bc][^a]
Regular Expressions are like machines that eat sequences one letter at a time
a [bc] aEnd
Ex: a[bc]+a matching "ghstuacbah"
ah
[bc]
[^bc]
[^bc][^a]
Regular Expressions are like machines that eat sequences one letter at a time
a [bc] ah
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
Regular Expressions are like machines that eat sequences one letter at a time
a [bc] aMATCH!
Ex: a[bc]+a matching "ghstuacbah"
[bc]
[^bc]
[^bc][^a]
Motifs are almost always either too selective or too specific
The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins
[GA]x(4)GK[ST]
Prob. of this motif ≈ (1/10)(1/20)(1/20)(1/10) = 0.000025
Expected number of matches in database with 3.2 x108 residues: about 8000!
About half of the proteins that match this motif are not NTPases of the P-loop class. (Lack of specificity)
Motifs are almost always either too selective or too specific
[GA]x(4)GK[ST]
Larger and larger alignments of true members of the classgive more and more exceptions to the rule (lack of sensitivity)
Extending the rule ([GAT]x(4)[GAF][KTL][STG]) leads to loss of specificity
A better way to model motifs
REGULAR EXPRESSIONS “(TTR[ATC]WT) N{15,22} (TRWWAT)”Can find alternative members of a classTreat alternative character states as equally likely.Treat all spacer lengths as equally likely.
PROFILES (Position-Specific Score Matrices)
Profiles turn alignments into probabilistic models
A graphical view of the same profile:
CCGTL…CGHSV…GCGSL…CGGTL…CCGSS…
G
C
H
GS
T
…C
GS
L
M
You can also allow for unobserved residues or bases in a profile by giving them small probabilities:
G
A
T
GC
T
…A
GC
T
A
C
TG
The probability that a sequence matches a profile P is the product of its parts:
G0.1
A0.7
T0.1
G0.8 C
0.7
T0.2
A0.8
G0.2 C
0.2
T0.6
A0.1
Ex: p(AAGCT | P) = p(A) x p(A) x p(G) x p(C) x p(T) = 0.8 x 0.7 x 0.8 x 0.7 x
0.6 = 0.18
P
In practice, we compare this probability to that of matching a null model
G
A
T
GC
T
A
G C
T
A
G
A
C
T
G
A
C
T
G
A
C
T
G
A
C
T
G
A
C
T
The null model is usually based on a composition.
G0.25
A0.25
C0.25
T0.25
G0.1
A0.7
T0.1
G0.8 C
0.7
T0.2
A0.8
G0.2 C
0.2
T0.6
A0.1
No positional information need be taken into account.
Example: probabilities of AAGCT with the two models
G0.25
A0.25
C0.25
T0.25
G0.1
A0.7
T0.1
G0.8 C
0.7
T0.2
…A0.8
G0.2 C
0.2
T0.6
A0.1
p = 0.18
p = 0.255 = 0.00098
Example: odds ratio of AAGCT with the two models
G0.25
A0.25
C0.25
T0.25
G0.1
A0.7
T0.1
G0.8 C
0.7
T0.2
…A0.8
G0.2 C
0.2
T0.6
A0.1
p = 0.18
p = 0.255 = 0.00098
The odds ratio is 0.18 / 0.00098 ≈ 184. It is 184 times more likely that AAGCT matches the profile than the null model!
Like with substitution scoring matrices, we prefer the log-odds as a profile score
€
log2
Pr(AAGCT |P)
Pr(AAGCT | null)= log2(
0.18
0.00098) = log2(184) = 7.5
A positive log-odds (score) indicates a match.
Digression: interpreting BLAST results
The bit score is a scaled log-odds of homology versus chance
Digression: interpreting BLAST results
E value is the expected number of hits with scores at least S
A better way to model motifs
REGULAR EXPRESSIONS “(TTR[ATC]WT) N{15,22} (TRWWAT)”Can find alternative members of a classTreat alternative character states as equally likely.Treat all spacer lengths as equally likely.
PROFILES (Position-Specific Score Matrices)Turn a multiple sequence alignment into a multidimensional (by
position) multinomial distribution.Explicit accounting of observed character statesCannot handle gaps (separate models must be made for different
spacer length -- O’Neill and Chiafari 1989)Can't be used to make alignments
Hidden Markov Models
A Hidden Markov Model is a machine that can either parse or emit a family of sequences according to a Markov model
The same symbols can put the machine in different states, (A,C,T,G can be in a promoter, a codon, a terminator, etc.) so we say the states are “hidden”
Example: The Dice Factory
P(2) = 1/6
P(1) = 1/6
P(3) = 1/6
P(4) = 1/6
P(5) = 1/6
P(6) = 1/6
P(2) = 1/10
P(1) = 3/6
P(3) = 1/10
P(4) = 1/10
P(5) = 1/10
P(6) = 1/10
FAIR BIASED
0.99 0.70
0.01
0.30
...11452161621233453261432152211121611112211...
GENERATED
PREDICTED
A Profile HMM is a profile with gaps
G
AT
G C
TA
G C
T
A
A Profile HMM is a profile with gaps
G
AT
G C
TA
G C
T
A
insertions
A Profile HMM is a profile with gaps
G
AT
G C
TA
G C
T
A
deletions
A Profile HMM is a profile with gaps
G
AT
G C
TA
G C
T
A
insertions
deletions
The HMMer Null Model (composition of insertions may be set by user, eg to match genome)
G0.25
A0.25
C0.25
T0.25
The Plan 7 architecture in HMMer
Permit local matches to sequence
Permit repeated matches to sequence
Permit local matches to model
HMMer2 (pronounced 'hammer', as in, “Why BLAST if you can hammer?”)
The HMMer2 design separates models from algorithms
With the same alignment or model design, you can easily change the search algorithm (encoded in the HMM) to do:
Multihit Global alignments of model to sequence
Multihit Smith-Waterman (local with respect to both model and sequence, multiple non-overlapping hits to sequence allowed)
Single (best) hit variants of both of the above.
This separation of model from algorithm provides a ready framework for sequence analysis(programs provided in HMMer)
hmmalign Align sequences to an existing model.
hmmbuild Build a model from a multiple sequence alignment.
hmmcalibrate Takes an HMM and empirically determines parameters that are used to make searches more sensitive, by calculating more accurate expectation value scores (E-values).
hmmconvert Convert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG profiles.
hmmemit Emit sequences probabilistically from a profile HMM.
hmmfetch Get a single model from an HMM database.
hmmindex Index an HMM database.
hmmpfam Search an HMM database for matches to a query sequence.
hmmsearch Search a sequence database for matches to an HMM.
HMMer2 format can be automatically converted for use with SAM