Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics...
Transcript of Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics...
Comparative genomicsChapter 15p
Features that are investigated includeFeatures that are investigated include1. Genome size variation2. Base composition biases3. Gene order
1. Colinearity2. Identification of orthologs3. Functional predictions
4. Whole genome alignments1. Identify genes2. Improve gene annotation3 Id tif l t i3. Identify regulatory regions
(phylogenetic footprinting)5. Evolution of genomes
Comparative genomics Genome Size
• There are huge differences in genome sizes among related organisms• More complex organisms not necessarily have larger genomes. • “C-value paradox”• Repetitive elements responsible for large proportion of the p p g p p
differences• Amount of DNA affect cell size, speed of cell cycle, and other traits
Lungfish 140 Gb!
D t l t t th l BAC B l i
Comparative genomics Aligning Genomic sequences
G 1
Dotplots to compare orthologous BACs: Barley (horizontal) vs. rice (vertical)
G 2
G 3
G 4
G 5
Th
G 1 G 2 G 3 G 4bG 4aTa Tb Tc Te Tf Te TgTd
Ti
Barley BAC 635P2 (horizontal) vs. Rice BAC 36I05 (vertical)Dubcovsky et al. Plant Physiology. 125:1342-1353
What have we learned?Dubcovsky and Dvorak 2007, Science 2007 316: 1862-1866
• Small groups of genes are dispersed within a sea of repetitive elements
Am genome T. monococcum
• Rapid divergence in 20 kb
A genome durum
intergenic regions:Comparison orthologous A and Am VRN2 regions• 1.1 MYA divergence• 31% similarity
Decay of synteny in intergenic regions1.0C(t)
A
31% similarity
• Fast decay of % similarity isHvsCh
0.4
0.6
0.8
Wheat -barley
A-B-D wheat
A-Am
B
C
• Fast decay of % similarity is also affecting genes • Fast rate of gene deletions and inactivation by retro insertions
HvsChHuman vs. Chimp7 mya, 3% divergence
Million years
0.0
0.2
0 2 4 6 8 10 12
-barleywheat
D E
insertions.•Small phenotypic changes in polyploid wheat. Valuable source of diversity
Comparative genomics Composition bias
Effect of repetitive elements i GC t tin GC content
•Grass genes 51% GC, Arabidopsis genes 44% GC
• The overall GC content in• The overall GC content in wheat 46% is 2% higher than Arabidopsis or rice
•The GC content of barley and wheat is very similar to the GCwheat is very similar to the GC content of the most abundant retroelements.
Comparative genomics Gene order
Different organisms show different rates of chromosome rearrangementsComparative maps of the wheat genome d ib d i f h i (A)described in terms of the rice genome (A) and the Aegilops umbellulata genome (B).(A) Data from Kurata et al. (1994), Van Deynze et al. (1995) (B) Data from Zhang et al. (1998).
A. Wheat- rice 50 million years
B. Wheat- Aegilops5 million years
Human – Mouse (80 million years): Many rearrangements (180 blocks) but some large conserved regions!Human – Chimpanzee (5 million years) almost completely colinear.• Allows transfer of genetic mapping data between organisms.• Allows functional analysis of human genes via precise deletion of their colinear regions in the mouse genome
Homology search for the mouse genome
• Homology search of all gygenes in the mouse genome27% i th t
EukaryoteOther
• 27% in other metazoans• 29% in other eukaryotes• 6% in other chordates
MetazoanMammal• 6% in other chordates
• 14% in other mammals• Less than 1% rodent
Rodentspecific
Chordate
Mammal
specific specific
Dotplots to discover duplications in rice
Comparative genomics Aligning Genomic sequencesp p
• A large genome duplication occurred approximately 70 million years ago before the divergences of most grass subfamilies.
Diploid grasses are actually ancient•Diploid grasses are actually ancient polyploids
•Figures shows the arrangement of duplicated protein-encoding genes in rice in the order found in the current sequence
bl B th X d Y tassembly. Both X and Y axes represent 45,174 genes in their chromosomal order.
• Colors indicate same (red) or opposite (green) transcriptional orientations.
• Differential Gene Loss Contributes to apparent Incongruities in Comparative Grass Genomics.
•Paterson et al. 2004 PNAS
• A similar result has been described for Arabidopsis suggesting that this species is also an ancient polyploid
Comparative Genomics. Gene Annotation
Though most human and mouse orthologs have similar structure some differences are sometimes observed:C i f th h d idi th• Comparison of the human and mouse spermidine synthase genes revealed an additional intron in the human gene that is not found in the mouse homolog
HumanHuman
MouseMouse
5,500 bp
Algorithms for Aligning Genomic sequences
Fi di th l i BLASTZ (http://zpicture dcode org/ see zpicture pdf)Finding orthologous region between two genomes is nontrivial
Existing methods based on
BLASTZ (http://zpicture.dcode.org/, see zpicture.pdf)
• Local Alignment program. Aligns 2-Mb<1min.
• 1st masks regions in seq. 1 repeated in seq. 2
• Search stretches of 19-bp /12 matches (1 transition OK)Existing methods based on dynamic programming algorithms (e.g. Needleman and Wunsch / Smith and Waterman) or
Search stretches of 19 bp /12 matches (1 transition OK)
• After initial match a gap free extension is performed until a cumulative score reaches a threshold (3000)
• If threshold is met, then realigned now allowing gaps
hashing (e.g. BLAST / FASTA) rapidly run out of memory.
Even Megablast and BLAT
• Alignments with scores >5000 move to next phase
• Scores are calculated using refined substitution matrices based on aligned human-mouse non-coding
• Connect individual alignments separated by <50-kbcan not handle large genome comparisons
Specific programsBLASTN BLASTZ
Connect individual alignments separated by 50 kb
• BLASTZ
• LAGAN
• AVIDAVIDSubstitution matrices used by BLAST and BLASTZ. BLASTZ matrix based on observed substitutions in aligned conserved regions of mouse and human. BLASTZ aligned mouse sequences to 40% of the human genome.
ZPicture http://zpicture.dcode.org/
zPicture is a pair-wise alignment and visualization tool that compares two sequences using the localalignment program BlastZMultiple sequences: http://zpicture.dcode.org/multiz.php
Gene 1 4548-7287 Gene 2 8663-13534 R Gene3 14166-22601
zPicture alignment Rice (AY013245) vs. Barley (AY013246) [no gene annotation provided so all in red]
Gene4 24136-29057 Repeat
Comparative genomics LAGAN Limited Area Global Alignment of Nucleotideshttp://lagan.stanford.edu/lagan_web/index.shtml
LAGAN: Global pair-wise and multiple alignment of finishedmultiple alignment of finished sequences.
• Detects closely and distantly related sequences
If f th i• If some of the sequences are in a draft format your query will be redirected to AVID
• Multiple alignment will be visualized by VISTAby VISTA
• This is the only alignment program available through the VISTA server that produces true multiple alignments.
LAGAN f b tt th BLASTZ• LAGAN performs better than BLASTZ for distantly related organisms
• Shuffle-LAGAN: detects rearrangements and inversions
AVID
AVID: global pair-wise alignment.
• Fast alignment of large seq.
Red: non-overlapping, non-crossing matches
Fast alignment of large seq.
• Detects weak homologies
• One of the sequences should be finished but all others can be
Clean Matches (1st) & Repeat matches
Set of non-overlapping, non-crossing matchesfinished, but all others can be either finished or in a draft format. • For all finished sequences in the set AVID generates all-against-all
pp g, g
Recursion
set, AVID generates all against all pair-wise alignments
• Draft sequences are aligned to the finished sequence.
Repeat for each inter-anchor region from the previous stepq
• View using VISTA
• Similar to LAGAN in that it uses anchorsanchors
VISTA• Option to use AVID or LAGAN• Option to use AVID or LAGAN.• If one of the sequences is not finished uses AVID.
• Can align large sequence to genome bliassemblies
• Can perform multiple alignments
• Can enter GenBank accessions
• Accepts annotation filesAccepts annotation files
• Window size (default 100 bp) and minimum conservation to show (default 50%) can be adjusted.
• Aligns in forward orientation• Aligns in forward orientation
Wheat BAC (AF459639) against the rice genome annotation
http://genome.lbl.gov/vista/index.shtml
MULTI-LAGAN http://lagan.stanford.edu/lagan_web/index.shtml
Barley viewRice view
Rice
Wheat
y
Gene 1 2563 6313G 1 G 2 Gene3 14166 22601
Rice view
Wheat
Barley
Rice
Wheat
Gene 1 2563-63136 exons
Gene 1 4548-72876 exons
Gene 2 8663-1353413 exons Missed barley
Gene3 14166-2260118 exons
Wheat
Barley
Rice
Gene 2 30279-3439313 exons Missed ReverseGene4 24136-29057
18 exons
MULTI-LAGAN: similar to ClustalWI d d i li d l PDF f Wheat
Rice
Gene3 64887-7306118 exons
It produces dynamic alignments and also PDF of the alignments presented here.
Any of the genomes can be used as reference
In the alignment between rice-barley-wheat it fails Rice
Wheat
Gene4a 74857-80916 Gene4b 84895-88925
to detect the inverted Gene 2 in barley.
It produces an evolutionary tree for the submitted sequences
18 exons 18 exons
Shuffle –LAGAN Barley view
http://genome.lbl.gov/cgi-bin/VistaInput?align_pgm=slagan&num_seqs=2
y
Gene 1 6 exons
Gene 2 13 exons Reverse
Gene3 18 exons
Gene4a 18 exons Gene4b 18 exons
Different levels of conservation
Precomputed genomic alignmentsMajor sources of Precomputed Whole-genome alignmentsMajor sources of Precomputed Whole-genome alignments• Ensembl http://www.ensembl.org/index.html [human-chimp-mouse-rat-chicken-fugu-zebrafish-Drosophila-C.elegans]
• VISTA browser http://genome.lbl.gov/vista/index.shtml [human-chimp-mouse-rat-chicken] MULTI-LAGAN
• UCSC Genome Browser http://genome.ucsc.edu/ [human-chimp-mouse-rat-chicken-fugu] BLASTZ
Chimp
Mouse
Vista Rat
Chicken
Conservation
UCSC Chimp
Mouse
Conservation
ouse
RatChicken
Rep. Masker
Using multiple species for phylogenetic footprinting
• The location of regulatory sequences can also be found bycan also be found by comparing several related sequences
• Multiple alignments• Multiple alignments performed
• Better able to home in on important regionson important regions
• Conservation alone not enough; need to
lid t i t fvalidate importance of elements
Phylogenetic FootprintingA large portion of the
5’ TATA 3’
A large portion of the genome contains regulatory information
Ci l tCis-regulatory elements.
• Basal transcription machinery binding sites (TATA)
• Enhancers
Programs to detect conserved regulatory elements
• FootPrinterFootPrinter• rVISTA• Enhancers:
http://enhancer.lbl.gov/
Identification of conserved regions: phylogenetic footprint
FootPrinter http://genome.cs.mcgill.ca/cgi-bin/FootPrinter3.0/FootPrinterInput2.pl
• Useful to identify putative transcription factors binding sites
• Fig: Upstream regions of 4 Beta-globin genes, prealigned with ClustalW (Motif size=12, Max # of mutations =0