Welcome to Introduction to Computational Genomics for Infectious Disease
description
Transcript of Welcome to Introduction to Computational Genomics for Infectious Disease
![Page 1: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/1.jpg)
Welcome toIntroduction to Computational
Genomics for Infectious Disease
![Page 2: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/2.jpg)
Course Instructors
• Instructor
James Galagan
• Teaching Assistants
• Lab Instructors
Brian Weiner Desmond Lun
Antonis Rokas Mark Borowsky Jeremy Zucker
Reinhard Engels Aaron Brandes Caroline Colijn
Other members of Broad Microbial Analysis Group
![Page 3: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/3.jpg)
Schedule and Logistics• Lectures
• Labs
Tues/Thurs 11-12:30Harvard School of Public Health: FXB-301
The François-Xavier Bagnoud Center, Room 301
Wed/Fri 1-3Broad Institute: Olympus RoomFirst floor of Broad Main Lobby
See front desk attendant near entrance
Individual computers and software providedNo programming experience required
![Page 4: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/4.jpg)
Website
• Contact information• Directions to Broad
• Lecture slides• Lab handouts
• Resources
www.broad.mit.edu/annotation/winter_course_2006/
![Page 5: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/5.jpg)
Goals of Course
• Introduction to concepts behind commonly used computational tools
• Recognize connection between different concepts and applications
• Hands on experience with computational analysis
![Page 6: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/6.jpg)
Concepts and Applications
• Lectures will cover concepts– Computationally oriented
• Labs will provide opportunity for hands on application of tools– Nuts and bolts of running tools– Application of tools not covered in lectures
![Page 7: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/7.jpg)
Computational Genomics Overview
Slide Credit: Manolis Kellis
![Page 8: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/8.jpg)
Topics
1. Probabilistic Sequence Modeling
2. Clustering and Classification
3. Motifs
4. Steady State Metabolic Modeling
![Page 9: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/9.jpg)
Topics Not Covered
• Sequence Alignment• Phylogeny (maybe in labs)• Molecular Evolution• Population Genetics
• Advanced Machine Learning– Bayesian Networks– Conditional Random Fields
![Page 10: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/10.jpg)
Applications to Infectious Disease
• Examples and labs will focus on the analysis of microbial genomics data– Pathogenicity islands– TB expression analysis– Antigen prediction– Mycolic acid metabolism
• But approaches are applicable to any organism and to many different questions
![Page 11: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/11.jpg)
Probabilistic Modeling of Biological Sequences
ConceptsStatistical Modeling of Sequences Hidden Markov Models
ApplicationsPredicting pathogenicity islandsModeling protein families
Lab PracticalBasic sequence annotation
![Page 12: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/12.jpg)
Probabilistic Sequence Modeling
• Treat objects of interest as random variables– nucleotides, amino acids, genes, etc.
• Model probability distributions for these variables
• Use probability calculus to make inferences
![Page 13: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/13.jpg)
Why Probabilistic Sequence Modeling?
• Biological data is noisy
• Probability provides a calculus for manipulating models
• Not limited to yes/no answers – can provide “degrees of belief”
• Many common computational tools based on probabilistic models
![Page 14: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/14.jpg)
Sequence AnnotationGCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGTTCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTTGCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCCCCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACCTGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCCGCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACCGGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCGACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTGTACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCGTATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTGGTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTCATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAATGATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTGGCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTCGCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGATATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAGGTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATCGAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGATCCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
![Page 15: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/15.jpg)
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGTTCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTTGCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCCCCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACCTGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCCGCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACCGGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCGACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTGTACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCGTATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTGGTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTCATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAATGATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTGGCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTCGCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGATATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAGGTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATCGAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGATCCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
Sequence Annotation
Gene
![Page 16: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/16.jpg)
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGTTCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTTGCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCCCCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACCTGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCCGCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACCGGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCGACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTGTACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCGTATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTGGTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTCATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAATGATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTGGCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTCGCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGATATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAGGTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATCGAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGATCCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
Sequence Annotation
Gene
Promoter Motif
KinaseDomain
![Page 17: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/17.jpg)
Probabilistic Sequence Modeling
• Hidden Markov Models (HMM)– A general framework for sequences of
symbols (e.g. nucleotides, amino acids)– Widely used in computational genomics
1. Hmmer – HMMs for protein families
2. Pathogenicity Islands
![Page 18: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/18.jpg)
Neisseria meningitidis, 52% G+C
(from Tettelin et al. 2000. Science)
GC Content
Pathogenicity Islands
• Clusters of genes acquired by horizontal transfer– Present in pathogenic species
but not others
• Frequently encode virulence factors– Toxins, secondary
metabolites, adhesins
• (Flanked by repeats, gene content, phylogeny, regulation, codon usage)
• Different GC content than rest of genome
![Page 19: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/19.jpg)
Application: Bacillus subtilis
![Page 20: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/20.jpg)
Modeling Sequence Composition• Calculate sequence distribution from
known islands– Count occurrences of A,T,G,C
• Model islands as nucleotides drawn independently from this distribution
A: 0.15
T: 0.13
G: 0.30
C: 0.42
……
A: 0.15
T: 0.13
G: 0.30
C: 0.42
A: 0.15
T: 0.13
G: 0.30
C: 0.42
P(Si|MP)
... C C TA A G T T A G A G G A T T G A G A ….
![Page 21: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/21.jpg)
The Probability of a Sequence• Can calculate the probability of a particular sequence
(S) according to the pathogenicity island model (MP)
1 21
( | ) ( , ,... | ) ( | )N
N ii
P S MP P S S S MP P S MP
Example
S = AAATGCGCATTTCGAA6 4 3 2
6 4 3 2
11
( | ) ( ) ( ) ( ) ( )
(0.15) (0.13) (0.30) (0.42)
1.55 10
P S MP P A P T P G P C
A: 0.15
T: 0.13
G: 0.30
C: 0.42
![Page 22: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/22.jpg)
Sequence ClassificationPROBLEM: Given a sequence, is it an island?
– We can calculate P(S|MP), but what is a sufficient P value?
SOLUTION: compare to a null model and calculate log-likelihood ratio– e.g. background DNA distribution model, B
A: 0.25
T: 0.25
G: 0.25
C: 0.25
A: 0.25
T: 0.25
G: 0.25
C: 0.25
PathogenicityIslands
Background DNA
11
( | ) ( | )( | )log log log
( | ) ( | ) ( | )
N Ni i
ii i i
P S MP P S MPP S MPScore
P S B P S B P S B
A: -0.73
T: -0.94
G: 0.26
C: 0.74
A:
T:
G:
C:
Score MatrixA: 0.15
T: 0.13
G: 0.30
C: 0.42
![Page 23: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/23.jpg)
Finding Islands in Sequences
• Could use the log-likelihood ratio on windows of fixed size– What if islands have variable length?
• We prefer a model for entire sequence
TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCAGACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC
![Page 24: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/24.jpg)
A More Complex Model
Background Island
0.15
0.25
0.750.85
A: 0.25T: 0.25G: 0.25C: 0.25
TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCAGACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC
A: 0.15T: 0.13G: 0.30C: 0.42
![Page 25: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/25.jpg)
P
BB
PP
B
P P
B
P
B
P
B
P
B
P
B
A Generative Model
P P
B B B
P P
C A A A T G CGS:
B B B
P P P
B B
A: 0.42T: 0.30G: 0.13C: 0.15
A: 0.25T: 0.25G: 0.25C: 0.25
P(S|P)P(S|B)P(Li+1|Li)
Bi+1 Pi+1
Bi0.85 0.15
Pi0.25 0.75
![Page 26: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/26.jpg)
A Hidden Markov Model
Hidden States L = { 1, ..., K }
Transition probabilitiesaij = Transition probability from state i to state j
Emission probabilitiesei(b) = P( emitting b | state=i)
Initial state probability (b) = P(first state=b)
State i State j
ej(b)ei(b)
EmissionProbabilities
TransitionProbabilities
![Page 27: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/27.jpg)
What can we do with this model?
The model defines a joint probability over labels and sequences, P(L,S)
Implicit in model is what labels “tend to go” with what sequences (and vice versa)
Rules of probability allow us to use this model to analyze existing sequences
![Page 28: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/28.jpg)
Fundamental HMM Operations
Decoding• Given an HMM and sequence S• Find a corresponding sequence of
labels, L
Evaluation• Given an HMM and sequence S• Find P(S|HMM)
Training• Given an HMM w/o parameters
and set of sequences S• Find transition and emission
probabilities the maximize P(S | params, HMM)
Computation Biology
Annotate pathogenicity islands on a new sequence
Score a particular sequence (not as useful for this model – will come back to this later)
Learn a model for sequence composed of background DNA and pathogenicity islands
![Page 29: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/29.jpg)
The Hidden in HMM
• DNA does not come conveniently labeled (i.e. Island, Gene, Promoter)
• We observe nucleotide sequences
• The hidden in HMM refers to the fact that state labels, L, are not observed– Only observe emissions (e.g.
nucleotide sequence in our example)
State i State j
…A A G T T A G A G…
![Page 30: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/30.jpg)
“Decoding” With HMM
Pathogenicity Island Example
Given a nucleotide sequence, we want a labeling of each nucleotide as either “pathogenicity island” or “background
DNA”
Given observables, we would like to predict a sequence of hidden states that is most likely to
have generated that sequence
![Page 31: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/31.jpg)
The Most Likely Path
• Given a sequence, one reasonable choice for a labeling is:
* arg max ( , | )labels
L P Labels Sequence Model
The sequence of labels, L*, (or path) that makes the labels and sequence most likely given the
model
![Page 32: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/32.jpg)
Probability of a Path,Seq
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
PP
1 0 2 1 3 2 7
6
7 8
( | ) ( | ) ( | ) ( | ) ( | ) ( | )... ( | )
(0.85) (0.25)
4.9 10
P P G B P B B P C B P B B P A B P B B P C B
0.25 0.25
B B B
0.25
0.85 0.85 0.85 0.85B B B B B
0.85
0.25
0.85
0.25 0.25 0.25 0.25
0.85
![Page 33: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/33.jpg)
Probability of a Path,Seq
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
PP
1 0 2 1 3 2 7
7
3 6 2 2
( | ) ( | ) ( | ) ( | ) ( | ) ( | )... ( | )
(0.85) (0.25) (0.75) (0.42) 0.30 0.15
6.7 10
P P G B P B B P C B P B B P A B P P B P C B
B B B B B0.85
0.25
0.85
0.15 0.25
0.25 0.25 0.42 0.42 0.30 0.25 0.25
0.85
P P P0.750.75
We could try to calculate the probability of every path, but….
![Page 34: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/34.jpg)
Decoding
• Viterbi Algorithm– Finds most likely sequence of labels, L*, given
sequence and model
– Uses dynamic programming (same technique used in sequence alignment)
– Much more efficient than searching every path
* arg max ( , | )labels
L P Labels Sequence Model
![Page 35: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/35.jpg)
Probability of a Single Label
• Calculate most probable label, L*i , at each position i
• Do this for all N positions gives us {L*1, L*
2, L*3…. L*
N}
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
PPP
B
P
B
P
B B
P
B B
P
B
P
B
PP
Sum over all paths
P(Label5=B|S)Forward algorithm(dynamic programming)
![Page 36: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/36.jpg)
• Viterbi Algorithm– Finds most likely sequence of labels, L*, given
sequence and model
• Posterior Decoding– Finds most likely label at each position for all
positions, given sequence and model
{L*1, L*
2, L*3…. L*
N}
– Forward and Backward equations
Two Decoding Options
* arg max ( | , )labels
L P Labels Sequence Model
![Page 37: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/37.jpg)
Application: Bacillus subtilis
![Page 38: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/38.jpg)
Method
Nicolas et al (2002) NAR
Gene+ Gene-
AT Rich
Second Order Emissions
P(Si)=P(Si|State,Si-1,Si-2)(capturing trinucleotide
Frequencies)
Train using EM
Predict w/Posterior Decoding
Three State Model
![Page 39: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/39.jpg)
Results
Nicolas et al (2002) NAR
Gene on positive strand
Each line is P(label|S,model)
color coded by label
Gene on negative strand
A/T Rich- Intergenic regions- Islands
![Page 40: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/40.jpg)
Fundamental HMM Operations
Decoding• Given an HMM and sequence S• Find a corresponding sequence of
labels, L
Evaluation• Given an HMM and sequence S• Find P(S|HMM)
Training• Given an HMM w/o parameters
and set of sequences S• Find transition and emission
probabilities the maximize P(S | params, HMM)
Computation Biology
Annotate pathogenicity islands on a new sequence
Score a particular sequence (not as useful for this model – will come back to this later)
Learn a model for sequence composed of background DNA and pathogenicity islands
![Page 41: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/41.jpg)
Training an HMM
Transition probabilitiese.g. P(Pi+1|Bi) – the probability of entering a pathogenicity island from background DNA
Emission probabilitiesi.e. the nucleotide frequencies for background DNA and pathogenicity islands
B P
P(S|P)P(S|B)
P(Li+1|Li)
![Page 42: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/42.jpg)
Learning From Labelled Data
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
If we have a sequence that has islands marked, we can simply count
A: T: G: C:
A: 1/5T: 0G: 2/5C: 2/5
P(S|P)P(S|B)P(Li+1|Li)
Bi+1 Pi+1 End
Bi3/5 1/5 1/5
Pi1/3 2/3 0
Start 1 0 0
Endstart
P
B B B B B
P
ETC..
Maximum Likelihood Estimation
!
![Page 43: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/43.jpg)
Unlabelled Data
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
How do we know how to count?
A: T: G: C:
A:T: G:C:
P(S|P)P(S|B)P(Li+1|Li)
Bi+1 Pi+1 End
Bi
Pi ?Start
Endstart
PP
?
![Page 44: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/44.jpg)
Unlabeled Data
An idea:
1. Imagine we start with some parameters
2. We could calculate the most likely path, P*, given those parameters and S
3. We could then use P* to update our parameters by maximum likelihood
4. And iterate (to convergence)
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
P(S|P)0P(S|B)0P(Li+1|Li)0
Endstart
PP
P(S|P)1P(S|B)1P(Li+1|Li)1
P(S|P)2P(S|B)2P(Li+1|Li)2
P(S|P)KP(S|B)KP(Li+1|Li)K
…
B B BB B B B BB B B B B
P P P
![Page 45: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/45.jpg)
1. Initialize parameters
2. E Step Estimate probability of hidden labels , Q, given parameters and sequence
3. M Step Choose new parameters to maximize expected likelihood of parameters given Q
4. Iterate
Expectation Maximization (EM)
( | , )1Q P Labels S paramst
1arg max log ( , | )t tQ
paramsparams E P S labels params
P(S|Model) guaranteed to increase each iteration
![Page 46: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/46.jpg)
Expectation Maximization (EM)
EM frequently used in motif discovery
Lecture 3
Remember the basic idea!
1.Use model to estimate (distribution of) missing data2.Use estimate to update model
3.Repeat until convergence
EM is a general approach for learning models (ML estimation) when there is “missing data”
Widely used in computational biology
![Page 47: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/47.jpg)
A More Sophisticated Application
• Given amino acid sequences from a protein family, how can we find other members?– Can search databases with each known member – not
sensitive– More information is contained in full set
• The HMM Profile Approach– Learn the statistical features of protein family – Model these features with an HMM– Search for new members by scoring with HMM
Modeling Protein Families
We will learn features from multiple alignments
![Page 48: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/48.jpg)
UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISKUBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISKBAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSKUBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSKUBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISKUBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISKUBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-------------DKWSPALQIRTAAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---------PKGAWRPSLNIATUBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-------------EKWSALYDVRTCDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDDPQSGELPSERWNPTQNVRTBAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDDPQSGELPSERWNPTQNVRTUBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGEDKYGYEKPEERWLPIHTVETUBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN-------------RWSPTYDVSSUBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED-----------KDWRPAITIKQE2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR-------------DWTAELGIRHUBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA------------ENWKPATKTDQUBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS------------ENWKPCTKTCQUBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-------------QTWTALYDLTNUBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-------------EDWKPVLTINS
Human Ubiquitin Conjugating Enzymes
![Page 49: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/49.jpg)
Profile HMM
Ij
Start M1 Mj MN End
DjD1 DN
I I1 IN
ACDEFGHIKLMNOPQRSTVWY
ACDEFGHIKLMNOPQRSTVWY
A------------DSAG-
E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGANUBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEKUBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDENUBE2H LPDKYPFKSPSIGFMNKIFHPNIDEA
-GEICVNVLKR WTAELGIRHQVCLPVI A-----------ENWKPATKTDQ
-GQICLPIISSA-----------ENWKPCTKTCQSGTVCLDVIN-P-----------QTWTALYDLTN
![Page 50: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/50.jpg)
Using Profile HMMs
Decoding Find sequence of labels, L,
that maximizes P(L|S, HMM)
Evaluation• Find P(S|HMM)
Training• Find transition and emission
probabilities the maximize P(S | params, HMM)
Computation Biology
Align a new sequence to a protein family
Score a sequence for membership in family
Discover and model family structure
![Page 51: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/51.jpg)
Example: Modeling Globins
• Profile HMM from 300 randomly selected globin genes
• Score database of 60,000 proteins
![Page 52: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/52.jpg)
PFAM Collection of Profile HMMs
http://www.sanger.ac.uk/Software/Pfam/
![Page 53: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/53.jpg)
PFAM Resources• 8957 curated protein
families and domains• Each with HMM profile(s)• Coverage
– 73% of proteins in Swissprot and SP-TREMBLE
– 53% of “typical” genome sequence
![Page 54: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/54.jpg)
Example PFAM Entry
• Literature Links• Protein Structure• Domain Architectures• GO Functional Categories
Lab 1
![Page 55: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/55.jpg)
HMMER
• Implementation of Profile HMM methods
• Given a multiple alignment, HMMER can build a Profile HMM
• Given a Profile HMM (i.e. from PFAM), HMMER can score sequences for membership in the family or domain
![Page 56: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/56.jpg)
HMMs in Context
• HMMs– Sequence alignment– Gene Prediction
• Generalized HMMs– Variable length states – Complex emissions models– e.g. Genscan
• Bayesian Networks– General graphical model– Arbitrary graph structure– e.g. Regulatory network analysis
![Page 57: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/57.jpg)
References• Sean R Eddy, “Hidden Markov models,” Current Opinion in Structural Biology,
6:361-365, 1996.
• Sean R Eddy, “Profile hidden Markov models,” Bioinformatcis, 14(9):755-763, 1998.
• Anders Krogh, “An introduction to hidden Markov models for biological sequences,” In computational Methods in Molecular Biology, edited by S. L. Salzberg, D. B. Searls and S. Kasif, pp. 45-63, Elsevier, 1998.
• HMMER: profile HMMs for protein sequence analysis. http://hmmer.wustl.edu/
• Erik L. L. Sonnhammer et al, “Pfam: multiple sequence alignments andHMM-profiles of protein domains,” Nucleic Acids Research, 26(1):320-322, 1998.
• R. Durbin, S. Eddy, A. Krogh and G. Mitchison, BIOLOGICAL SEQUENCE ANALYSIS, Cambridge University Press, 1998.
![Page 58: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/58.jpg)
Tomorrow’s Lab
• Basic Sequence Analysis Tools– Argo Genome Browser– Blast– Gene prediction using Glimmer– Protein families with Hmmer and PFAM– Comparative synteny analysis
• Identify virulence factors by annotating and comparing virulent and avirulent bacterial sequences
![Page 59: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/59.jpg)
![Page 60: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/60.jpg)
The Hidden in HMM
• DNA does not come conveniently labeled (i.e. Pathogencity Island, Gene, Promoter)
• All we observe are the nucleotide sequences
• The hidden in HMM refers to the fact that the state labels, L, are not observed– Only observe emissions (e.g. nucleotide sequence
in our example)
![Page 61: Welcome to Introduction to Computational Genomics for Infectious Disease](https://reader030.fdocuments.net/reader030/viewer/2022032604/56812b1f550346895d8f1b20/html5/thumbnails/61.jpg)
Relation between Viterbi and ForwardVITERBI
Vj(i) = P(most probable path ending in state j with observation i)
Initialization:V0(0) = 1Vk(0) = 0, for all k > 0
Iteration:
Vj(i) = ej(xi) maxk Vk(i-1) akj
Termination:
P(x, *) = maxk Vk(N)
FORWARD
fl(i)=P(x1…xi,statei=j)
Initialization:f0(0) = 1fk(0) = 0, for all k > 0
Iteration:
fl(i) = el(xi) k fk(i-1) akl
Termination:
P(x) = k fk(N) ak0
Slide Credit: Serafim Batzoglou