Algorithms for Alignment of Genomic Sequences
Michael Brudno
Department of Computer ScienceStanford University
PGA Workshop 07/16/2004
Conservation Implies Function
Exon
Gene
CNS:OtherConserved
Edit Distance Model (1)
Weighted sum of insertions, deletions & mutations to transform one string into another
AGGCACA--CA AGGCACACA| |||| || or | || ||A--CACATTCA ACACATTCA
Edit Distance Model (2)
Given: x, y
Define: F(i,j) = Score of best alignment ofx1…xi to y1…yj
Recurrence: F(i,j) = max ( F(i-1,j) – GAP_PENALTY,
F(i,j-1) – GAP_PENALTY,F(i-1,j-1) + SCORE(xi, yj))
Edit Distance Model (3)
F(i,j) = Score of best alignment ending at i,j
Time O( n2 ) for two seqs, O( nk ) for k seqs
F(i,j)
F(i,j-1)F(i-1,j-1)
F(i-1,j)
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Overview
• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
Local Alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
F(i,j) = max (F(i,j), 0)
Return all paths with a position i,j where
F(i,j) > C
Time O( n2 ) for two seqs, O( nk ) for k seqs
Heuristic Local Alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
BLAST FASTA
CHAOS: CHAins Of Seeds
1. Find short matching words (seeds)
2. Chain them
3. Rescore chain
CHAOS: Chaining the Seeds
• Find seeds at current location in seq1
locationin seq1
seedseq1
seq2
CHAOS: Chaining the Seeds
locationin seq1
distancecutoff
seedseq1
seq2
• Find seeds at current location in seq1
CHAOS: Chaining the Seeds
locationin seq1
distancecutoff
gapcutoff
seedseq1
seq2
• Find seeds at current location in seq1
CHAOS: Chaining the Seeds
• Find seeds at current location in seq1
• Find the previous seeds that fall into the search box
locationin seq1
distancecutoff
gapcutoff
seed
Search box
seq1
seq2
CHAOS: Chaining the Seeds
• Find seeds at current location in seq1
• Find the previous seeds that fall into the search box
• Do a range query: seeds are indexed by their diagonal
locationin seq1
distancecutoff
gapcutoff
seed
Search box
seq1
seq2
Range of search
CHAOS: Chaining the Seeds
• Find seeds at current location in seq1
• Find the previous seeds that fall into the search box
• Do a range query: seeds are indexed by their diagonal.
• Pick a previous seed that maximizes the score of chain location
in seq1
distancecutoff
gapcutoff
seed
Search box
seq1
seq2
Range of search
CHAOS: Chaining the Seeds
• Find seeds at current location in seq1
• Find the previous seeds that fall into the search box
• Do a range query: seeds are indexed by their diagonal.
• Pick a previous seed that maximizes the score of chain location
in seq1
distancecutoff
gapcutoff
seed
Search box
seq1
seq2
Range of search
Time O(n log n), where n is number of seeds.
CHAOS Scoring
• Initial score = # matching bp - gaps
• Rapid rescoring: extend all seeds to find optimal location for gaps
Overview
• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
Global Alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
x
y
z
LAGAN: 1. FIND Local Alignments
1. Find Local Alignments
2. Chain Local Alignments
3. Restricted DP
LAGAN: 2. CHAIN Local Alignments
1. Find Local Alignments
2. Chain Local Alignments
3. Restricted DP
LAGAN: 3. Restricted DP
1. Find Local Alignments
2. Chain Local Alignments
3. Restricted DP
MLAGAN: 1. Progressive Alignment
Given N sequences, phylogenetic tree
Align pairwise, in order of the tree (LAGAN)
Human
Baboon
Mouse
Rat
MLAGAN: 2. Multi-anchoring
XZ
YZ
X/Y
Z
To anchor the (X/Y), and (Z) alignments:
Cystic Fibrosis (CFTR), 12 species
• Human sequence length: 1.8 Mb• Total genomic sequence: 13 Mb
HumanBaboon Cat Dog
Cow Pig
MouseRat
ChimpChicken
Fugufish
Zebrafish
CFTR (cont’d )
9055099.7%MammalsLAGAN
9086296%Chicken & Fishes
Chicken & Fishes
Mammals6704547
99.8%MLAGAN
98%
MAX MEMORY
(Mb)TIME (sec)
% Exons Aligned
Automatic computational system for Automatic computational system for comparative analysis of pairs of genomescomparative analysis of pairs of genomes http://pipeline.lbl.gov
Alignments (all pair combinations):
Human Genome (Golden Path Assembly)Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002)Rat assemblies: January 2003, February 2003
----------------------------------------------------------D. Melanogaster vs D. Pseudoobscura February 2003
Tandem Local/Global Approach
•Finding a likely mapping for a contig (BLAT)
Progressive Alignment Scheme
yes
no yes no
Human, Mouse and Rat genomes
Pairwise M/R mapping
Aligned M&R fragments Unaligned M&R sequences
Map to Human GenomeMapping aligned fragments by union of M&R local BLAT hits on the human genome
H/M/R MLAGAN alignment
M/R pairwise alignment
M/H and R/H pairwise
alignment
Unassigned M&R DNA fragments
yes no
Computational Time
23 dual 2.2GHz Intel Xeon node PC cluster.
Pair-wise rat/mouse – 4 hours
Pair-wise rat/human and mouse/human – 2
hours
Multiple human/mouse/rat – 9 hours
Total wall time: ~ 15 hours
Distribution of Large Indels
0
20
40
60
80
100
120
140
160
180
200
100 150 200 250 300 350 400 450 500 550
Indel length
Count
Evolution Over a Chromosome
Overview
• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
Evolution at the DNA level
…ACGGTGCAGTTACCA…
…AC----CAGTCCACCA…
Mutation
SEQUENCE EDITS
REARRANGEMENTS
Deletion
InversionTranslocationDuplication
Local & Global Alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
Local Global
Glocal Alignment Problem
Find least cost transformation of one sequence into another using new operations
•Sequence edits
•Inversions
•Translocations
•Duplications
•Combinations of above
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
Shuffle-LAGAN
A glocal aligner for long DNA sequences
S-LAGAN: Find Local Alignments
1. Find Local Alignments
2. Build Rough Homology Map
3. Globally Align Consistent Parts
S-LAGAN: Build Homology Map
1. Find Local Alignments
2. Build Rough Homology Map
3. Globally Align Consistent Parts
Building the Homology Map
d
a b
c
Chain (using Eppstein
Galil); each alignment
gets a score which is
MAX over 4 possible
chains.
Penalties are affine (event and distance components)
Penalties:
a) regular
b) translocation
c) inversion
d) inverted translocation
S-LAGAN: Build Homology Map
1. Find Local Alignments
2. Build Rough Homology Map
3. Globally Align Consistent Parts
S-LAGAN: Global Alignment
1. Find Local Alignments
2. Build Rough Homology Map
3. Globally Align Consistent Parts
S-LAGAN Results (CFTR)
Local
Glocal
S-LAGAN Results (CFTR)
Hum/Mus
Hum/Rat
S-LAGAN Results (IGF cluster)
S-LAGAN results (HOX)
• 12 paralogous genes• Conserved order in mammals
S-LAGAN results (HOX)
• 12 paralogous genes• Conserved order in mammals
S-LAGAN Results (Chr 20)
• Human Chr 20 v. homologous Mouse Chr 2.
• 270 Segments of conserved synteny
• 70 Inversions
S-LAGAN Results (Whole Genome)
LAGAN S-LAGAN
Total 37% 38%
Exon 93% 96%
Ups200 78% 81%
CPU Time
350 Hrs 450 Hrs
• Used Berkeley Genome Pipeline
• % Human genome aligned with mouse sequence
• Evaluation criteria from Waterston, et al (Nature
2002)
Rearrangements in Human v. Mouse
Preliminary conclusions:
• Rearrangements come in all sizes
• Duplications worse conserved than other rearranged regions
• Simple inversions tend to be most common and most conserved
What is next? (Shuffle)
• Better algorithm and scoring
• Whole genome synteny mapping
• Multiple Glocal Alignment(!?)
Overview
• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
Biological Story
• Math1 (Mouse Atonal Homologue 1, also ATOH) is a gene that is responsible for nervous system development
Align Human, Mouse, Rat & Fugu
Detailed Alignment
hum_a : CAATAGAGGGTCTGGCAGAGGCTC---------------------CTGGC @ 57336/400001mus_a : CAATAGAGGGGCTGGCAGAGGCTC---------------------CTGGC @ 78565/400001rat_a : CAATAGAGGGGCTGGCAGAGACTC---------------------CTGGC @ 112663/369938fug_a : TGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCGTGGGC @ 36013/68174
hum_a : CGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 57386/400001mus_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 78615/400001rat_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 112713/369938fug_a : CGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCGAGCGC @ 36063/68174
Can we align human & fly???
CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA
Putting it all together
CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA
Overview
• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
Acknowledgments
Stanford:Serafim BatzoglouArend SidowMatt Scott
Gregory Cooper Chuong (Tom) DoSanket MaldeKerrin SmallMukund Sundararajan
Berkeley: Inna DubchakAlexander Poliakov
Göttingen:Burkhard Morgenstern
Rat Genome Sequencing Consortium
http://lagan.stanford.edu/
Top Related