Theory and Application of Multiple Sequence AlignmentsTheory and Application of Multiple Sequence...
Transcript of Theory and Application of Multiple Sequence AlignmentsTheory and Application of Multiple Sequence...
Theory and Application of Multiple Sequence Alignments
Brett Pickett, PhD
a.k.a What is a Multiple Sequence Alignment,
How to Make One, and What to Do With It
History
• Structure of DNA discovered (1953)
• First (phage) genome determined in 1977
• Human genome project begun in 1990
• First living organism (H.i.) sequenced in 1995
• Human “Rough draft” completed in 2000
– NHGRI (public) vs. J. Craig Venter (private)
• Used “super” computer to put human genome together in right order
What is a Genome?
• Genetic material required for organism to replicate – Eukaryotes (Humans): # chromosomes
– Prokaryotes (Bacteria): 1 chromosome
– Viruses: “what’s a chromosome?”
– 10 trillion cells in human body X 2m = 3.2 Gb • 780,000 times around Earth
• 67.8 roundtrips to the sun
– Bacteria (580 kb- 10 Mb)
– Virus (3.5 kb – 1.3 Mb)
http://www.rsc.org/chemsoc/timeline/pages/2001.html
Why are Genomes so Important?
• Encode all organismal functions
– DNA -> RNA -> protein
• Unique to each organism
– Find differences (mutations) only by comparing genomes with each other
www.thednastore.com/images/cells/mrdna1.jpg
How are Sequences Made? 1. Make lots of copies of original sequence (PCR)
2. Put the copies into a machine to make even more copies
3. Fluorescent (glow-in-the-dark) bases get incorporated randomly into new DNA molecule
4. Laser detects glowing bases and tells the computer the order of bases = sequence
http://bjpsbiotech.edublogs.org/files/2007/12/electropherogram.jpg
What’s the Next Step?
• After sequence is determined, then what?
• Make sense of it by comparing with other related (homologous) sequences
– Multiple Sequence Alignment
What is an Alignment?
• Lining up related (homologous) positions
– Allows comparison
Unaligned
Aligned
Comparing Sequences (Genomes)
• All DNA contains a unique genetic “fingerprint”
• Similarity reveals
– Related function
– Shared evolutionary history
education.vetmed.vt.edu/.../FINGERPRINT.jpg
Aligning with Computational Methods
• Computers can’t “see” patterns
– Use math to find best alignment by assigning scores
– Match
– Mismatch
– Gap
• Internal – Insertion / deletion (indel)
• Terminal – Missing information?
What is a Gap?
• Allows bases to be lined up even if sequences are different lengths
– Insertions / deletions (indels)
• Impossible to tell which sequence has lost (gained) information
– Terminal gaps
• Sequence is either naturally shorter or artificially cutoff
Mismatches Gaps
Nucleotide Alignment
• Custom Scores – Match – Mismatch – Gap-opening penalty
• Penalized for not having letter (begin a gap) • Why?
– Gap-extension penalty • Little or no penalty for lengthening a gap • Why?
– Scores balance between mismatch &
gap
Dynamic Programming
• Used to calculate alignment
– Breaks a very complicated process into smaller steps
– Helps computers to solve the problem faster
Sequence 1
Sequ
en
ce 2
Math
Read
http://www.myspacepimper.com/images/232763/Disney-s-Goofy-Baking-a-Cake.htm
Manual Alignment
Sequence A A T C
0 0 0 0 0
A 0
-4 5 -4
5
1 5 -4
5
1 -2 -4
1
-3 -2 -4
-2
T 0
-4 -2 1
1
-3 3 1
3
-1 10 -3
10
6 -1 -6
6
C 0
-4 -2 -3
-2
-6 -1 -1
-1
-5 1 6
6
2 15 2
15
Match = 5 Mismatch = -2 Gap Opening = -4 Gap Extension = 0
Traceback: Follow the highest scores back to the beginning Up or sideways = gap, diagonal = homology (line up)
A
A
A
-
T
T
C
C
Computer-Generated Alignment
• Much faster than we are
– 2 GHz = 2B calculations per second
– Don’t get tired, make mistakes, or get handcramps
Alignment Process
Types of Alignment
• Global
– Aligns entire sequence
– Permits gaps
– Forced even if sequences not homologous
• Local
– Aligns longest region possible with minimal (no) gaps
Beware!
• The computer is not always right
– Alignments
• Optimal: highest score
• True: evolutionarily correct
– Can be improved
• Hard for computer to accurately place indels (gaps) – Apply prior knowledge--codons
- AAA CCC
Lys Pro
AA- ACC C
??? Thr ?
Asn
Lys
vs. Nucleotide Sequence Amino Acid Sequence
BLAST
• Basic Local Alignment Search Tool
– Most frequently used alignment tool
– Local alignment of 1 sequence (query) against all known sequences (subjects) in database
• Uses a “heuristic” to reduce number of sequences it actually has to align – Like using “Google” to find most homologous sequences
BLAST Input
BLAST Output
How Does This Impact Me?
• Human Microbiome project – Sequence all bacteria in intestines
• Millions of bacteria in each gram of excrement – Which ones make us sick? How different is flora between people?
• Ocean Virus Metagenomics project – Try to get an idea of virus diversity across the globe
• Boat goes around N.A. collecting samples – Billions of viruses in each gallon of seawater
How Does This Impact Me (cont’d)?
• Used to take swabs, grow colonies on agar
– Antimicrobial resistance in turkeys
• Sequencing removes middle step
• How to quickly assign genus and species to new sequences?
– BLAST
• Project: New Phage from ponds
Other Uses for Alignments
SNP Detection
• Single Nucleotide Polymorphism
– Genetic changes occurring in at least one sequence
– May have biological significance
• Antibiotic resistance
• Changes could avoid detection by immune system
• Cause of genetic disease (CF)
Phylogenetic Trees
• Computer generated by: – Examining alignment
– Looking for shared mutations
• Show relationship(s) between sequences – History of sequences
• Where they came from
• Genetic changes that have occurred
CY065067
CY061195
CY065107
GU562458
CY065059
CY098563
CY098130
CY065011
CY061578
Clade
Node
Leaf
iOS Phylogram App (Free)
Branch
Recombination
• Can occur in all types of organisms – Eukaryotes – Prokaryotes – Viruses
• May change characteristic of organism – Make you sick (or not) – Not recognized by immune system – Fast way of getting lots of genetic changes
Breakpoint
RdRP
Genome 1
Genome 2
Daughter Sequence
Major Parent
Minor Parent
Reassortment
• Chromosomes (segments) from one organism replace those from another
– May change characteristic of organism
• Make you sick (or not)
• Not recognized by immune system
• Fast way of getting lots of genetic changes
+ =
Other Analysis Options
• Align Sequences
• Look for genetic changes (genotype) that are associated with traits (phenotype) – Host
– How sick it makes you
– Drug resistance
– Inherited disease
• Do any mutations consistently accompany the traits? – Genome Wide Association
Studies
http://lovestats.wordpress.com/dman/
How Does an Alignment Get a Score?
• Amino acids
– Identical >> Similar >> Dissimilar
Score Lookup Table (Matrix)
Symmetrical Positive Scores on Diagonal (Matches)
Some Mismatches get Negative Scores
Some Mismatches don’t