Orthology predictions for whole mammalian genomes
description
Transcript of Orthology predictions for whole mammalian genomes
![Page 1: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/1.jpg)
Orthology predictions for whole mammalian genomes
Leo GoodstadtMRC Functional Genomics Unit
Oxford University
![Page 2: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/2.jpg)
Fini
shin
g
““Evolution of Orthologues”Evolution of Orthologues”
Selection pressures in orthologues and paralogsSelection pressures in orthologues and paralogs
““Gene Duplications”Gene Duplications”
Reproduction, immunity or chemosensationReproduction, immunity or chemosensation
““Synonymous substitution rates”Synonymous substitution rates”
Mutation and selection varies by chromosome sizeMutation and selection varies by chromosome size
““Gene birth in the human lineage”Gene birth in the human lineage”
Ongoing duplications underlie polymorphismOngoing duplications underlie polymorphism
![Page 4: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/4.jpg)
We are “consumers” of orthology / paralogy
Started off using Ensembl predictions
Ensembl 1:1 covered 50% of predicted mouse genes.
Ewan’s manual survey said 80%
How it started
![Page 5: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/5.jpg)
Paralogues evolve fast (and are fun!)
1) General observations for all mammalian genomes
![Page 6: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/6.jpg)
dmelds
imdy
akde
reda
nadp
se dvirdm
ojdg
rice
lecb
ricr
emc2
801hs
ap
mm
uscfa
mm
domoa
nagg
al0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14 Drosophila Nematodes Amniotes
Lin
ea
ge
sp
eci
fic d N/dS
Species
2) Observations for whole clades of species
![Page 7: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/7.jpg)
3) Inparalogues define lineage specific biology
Marsupial / Monodelphis biology revealed by Marsupial / Monodelphis biology revealed by lineage specific geneslineage specific genes
• ChemosensationChemosensation (OR, V1R and V2R )
• ReproductionReproduction (Vomeronasal Receptors, lipocalins, -microseminoprotein (12:1))
• ImmunityImmunity (IG chains, butyrophilins, leukocyte IG-like receptors, T-cell receptor chains and carcinoembryonic antigen-related cell adhesion molecules )
pancreatic RNAses • DetoxificationDetoxification (hypoxanthine phosphoribosyltransferase homologues nitrogen poor
diets)
• KRAB ZnFingersKRAB ZnFingers
![Page 8: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/8.jpg)
4) Interesting stories in the aggregate
![Page 9: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/9.jpg)
5) Treasure trove in the details
clade: #2 (ortholog_id = 17117 in panda) 159 mus genes 47 genes new to assembly 36 10 genes completely new to assembly 36 Interpro matches for this clade:
!!! Expansion mainly on chr5 and 14, although single (pseudogene?) versions on chr13 and chr16.!!! Mouse DLG5 is: chr14:22,966,420-22,978,653 (expressed in testis: AK147699)
gene identifier order chrm exons stop length -------------------- ----- ---- ----- ---- ------ MUS_GENE_21705 6639 5 spermatogenesis associated glutamate
(E)-rich protein 1, pseudogene 1 ; ENSMUSP00000086007 4 182 MUS_GENE_22420 6643 5 predicted gene, EG623898 ; ENSMUSP00000099126 2 72 < MUS_GENE_19599 6646 5 spermatogenesis associated glutamate
( E)-rich protein 1, pseudogene 1 (Speer1-ps1) on
chromosome 5 ; NCBIMUSP_83776567 4 157 < MUS_GENE_23688 6651 5 predicted gene, EG623898 ; ENSMUSP00000094421 2 72 MUS_GENE_19774 6657 5 spermatogenesis associated glutamate
(E)-rich protein 3 ;
On going mouse inparalogues analysis: Lots and lots of reproductive genes
![Page 10: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/10.jpg)
Secretoglobin Protein Family members: Androgen-binding proteins. Emes et al. (2004) Genome Res.
14(8):1516-29
6) Candidates for evolutionary and functional analyses
![Page 11: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/11.jpg)
Hedges, SB Nature Reviews Genetics 3, 838 -849 (2002)
Available GenomesAvailable GenomesAndAnd
DivergencesDivergences
![Page 12: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/12.jpg)
How do we find function in the genome?
• Nothing in Biology Makes Sense Except in the Light of Evolution. Theodosius Dobzhansky (1900-1975).
![Page 13: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/13.jpg)
How to find the function in the genome?
Similar Sequences
Common Ancestry (homology)
Similar Structures / Folds
Similar Functions ?
(Genes / Genome regions)
![Page 14: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/14.jpg)
ARs
WholeGenome
How much of the genome is functional?How much of the genome is functional?Compare with the mouse
Ancestral Repetitive (AR) Ancestral Repetitive (AR) sequence is is non-functional and has evenly non-functional and has evenly distributed conservation scores (red) distributed conservation scores (red) (symmetrical bell shaped due to biological variation)
Whole GenomeWhole Genome sequence contains contains some functional sequence under some functional sequence under selection and thus has a small excess selection and thus has a small excess of conserved sequence under of conserved sequence under purifying selectionpurifying selection(asymetrical)
Functional sequence ==Whole GenomeWhole Genome - Ancestral Ancestral Repetitive Repetitive = 5%= 5%
N.B. This is an estimate that doesn’t take into account sequence
•Turning over rapidly (not shared by mouse/human)•Under positive (diversifying) selection
![Page 15: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/15.jpg)
The human genome (euchromatic sequence)
Unknown (old repetitive junk?)
Protein coding: 1.2%UTR: 0.3%
Repeats(Transposable elements, …)~45%
Conserved non-coding (3.5% ?)
Neutral
![Page 16: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/16.jpg)
Conserved non-coding materialConserved non-coding material
• Transcription factor binding sitesTranscription factor binding sites• Enhancers, insulators and other Enhancers, insulators and other
non-transcribed regulatory elementsnon-transcribed regulatory elements• Alternative splicing signalsAlternative splicing signals• Transfer RNAs, ribosomal RNAsTransfer RNAs, ribosomal RNAs• Small RNAs (Small RNAs (e.g. snoRNAs, microRNAs, siRNAs and piRNAs))
regulatory/gene silencing / RNA degradationregulatory/gene silencing / RNA degradation
• MacroRNAs (e.g. Xist)MacroRNAs (e.g. Xist)enzymatic? / chromosome inactivationenzymatic? / chromosome inactivation
![Page 17: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/17.jpg)
Functional parts of genes are highly conserved
![Page 18: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/18.jpg)
How many protein coding genes?• Walter Gilbert [1980s] 100k• Antequera & Bird [1993] 70-80k• John Quackenbush et al. (TIGR)
[2000] 120k• Ewing & Green [2000] 30k• Tetraodon analysis [2001] 35k• Human Genome Project (public) [2001] ~ 31k• Human Genome Project (Celera) [2001] 24-40k• Mouse Genome Project (public) [2002] 25k -30k• Lee Rowen [2003] 25,947• Human Genome Project (finishing) 20-25k [2004]• Current predictions [2008] 19-20k
![Page 19: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/19.jpg)
Traditional Genome OrthologyReciprocal BLAST best hits between longest
transcript of each gene (+ synteny)Assumes:• Protein similarity is proportional to
evolutionary distance (selection is invariant!)• Pairwise relationships adequately represent
the evolutionary tree• No gene losses or missing predictions • Alternative splicing can be ignored! • No gene translocations after tandem
duplication
![Page 20: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/20.jpg)
Orthology prediction methods
• Two genomes– Reciprocal best blast hit
• Multiple genomes– Clustering of
• reciprocal best hits• protein similarities
QueryBlast hits
![Page 21: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/21.jpg)
Reciprocal Blast Best Hits
Advantages:• Fast, Well understood• Works well for distant lineages• Can correlate with protein structure (domains)Disadvantages:• Only provides 1:1 orthologues in the best case• Can be difficult to reconcile with the species tree
![Page 22: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/22.jpg)
Genes on chromosome of species 1
Genes on chromosome of species 2
![Page 23: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/23.jpg)
?
Reciprocal Blast Best Hits
![Page 24: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/24.jpg)
?
Reciprocal Blast Best Hits
![Page 25: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/25.jpg)
How to add duplicated genes? synteny
Ensembl compara in the past• Local gene order tends to be conserved in
mammalian lineages• Look for inparalogs locally even if the protein
distances don’t add up (sequence error, sampling error etc.)
![Page 26: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/26.jpg)
?
Blast Best Hits in Local Regions
![Page 27: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/27.jpg)
?
Blast Best Hits in Local Regions
![Page 28: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/28.jpg)
Problems with relying only on synteny
Local homologs are often not inparalogs:
•Local rearrangements
•Missing predictions
(neighbouring orphans)
•Need sanity checking
![Page 29: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/29.jpg)
Human and Mouse chromosomes:
•Extensive rearrangements only over larger regions•Conservation of gene order in the short range
![Page 30: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/30.jpg)
Mouse chromosome 2Mouse chromosome 2
Rat chromosome 3Rat chromosome 3
One to oneOne to many
Many to manyMany to one
Olfactory Orthology from compara
![Page 31: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/31.jpg)
Olfactory OrthologyMouse chromosome 2Mouse chromosome 2
Rat chromosome 3Rat chromosome 3
One to oneOne to many
Many to manyMany to one
![Page 32: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/32.jpg)
Inparanoid
• Remm,M., Storm,C.E. and Sonnhammer,E.L.L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052.
• Avoids multiple alignments and phylogenetic methods for speed and to avoid errors
• Heuristics are implicitly phylogenetic
![Page 33: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/33.jpg)
How Inparanoid worksLongest Transcripts
Pairwise alignments scores
Reciprocal Best Hits are orthologues
Add lineageSpecific duplicates
(inparalogsinparalogs)With confidences
Resolve conflicts
Use cutoff2.
3.
4.
5.
Orthology
![Page 34: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/34.jpg)
Identify “inparalog” candidatesIdentify “main” orthologuesLongest Transcripts
Pairwise alignments scores
Reciprocal Best Hits are orthologues
Add lineageSpecific duplicates
(inparalogsinparalogs)With confidences
Resolve conflicts
Use cutoff2.
3.
4.
Orthology
Reciprocal Best Hits are orthologues
Add lineageSpecific duplicates
(inparalogsinparalogs)
Add lineageSpecific duplicates
(inparalogsinparalogs)With confidencesWith confidences
5.
![Page 35: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/35.jpg)
Confidence values for inparalogs
1. Most confident inparalog is when the inparalog is sequence identical to main orthologue.
2. Maximum value = scoreidentical – scoreorthologs
3. Confidence = (scoreinparalog – scoreorthologs) / (scoreidentical – scoreorthologs)
AA BB
![Page 36: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/36.jpg)
Resolving conflictsLongest Transcripts
Pairwise alignments scores
Reciprocal Best Hits are orthologues
Add inparalogsWith confidences
Resolve conflictsResolve conflicts
Use cutoff2.
3.
4.
5.
Orthology
1. Merge if orthologs already clustered in same group
2. Merge if two equally good best hits
3. Delete weaker group
4. Merge significantly overlapping
5. Divide overlapping
![Page 37: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/37.jpg)
Why are there conflicts?
• Protein differences are a proxy for evolutionary time
• Protein similarity scores approximate protein differences (sequence, alignment, estimation errors)
• Pairwise scores can be used to (conceptually) recover phylogenetic (tree) data
![Page 38: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/38.jpg)
Alternatives: phylogenetic methods
• Inparanoid is great because it models phylogeny explicitly
• Why not use phylogenetic methods directly?• Multiple estimators of protein distance
4 pairwise scores used out of 30
![Page 39: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/39.jpg)
Phylogenetic methods
• Iterative distance methods are very fast, suitable for whole genome analyses (variants on neighbor joining)
• Statistically consistent with evolutionary models (can have explicit error model with evolutionary distances, e.g. bionj)
• Inparanoid type consistency checking can be carried out after phylogeny is predicted
![Page 40: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/40.jpg)
Advantages
• Does not saturate over long evolutionary distances
• Easy to align / predict genes (unlike non-coding regions)
• Sometimes cDNA sequence is not available
Disadvantage
• Assumes constant evolutionary rate
• Assumes invariant selection
Is protein similarity a good proxy for evolutionary distance?
![Page 41: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/41.jpg)
• Redundant genetic codeRedundant genetic code, e.g., e.g. GC GCAA GC GCCC GC GCGG GC GCTT
• Third base of a codon “wobbles” without Third base of a codon “wobbles” without changing the translated amino acid changing the translated amino acid
• ddSS approximates neutral mutation rate approximates neutral mutation rate (without selection) in coding regions(without selection) in coding regions
Use Silent Mutations as a genetic clock
→ → AlanineAlanine}}
![Page 42: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/42.jpg)
• Easier to align than Ancestral Repeats
• Not neutral sequence!!
• Genomic > 2x variation in dS
• Assumes most gene families are local due to tandem duplication and share dS
• Assume (partial) gene conversions are infrequent
dS as proxy for evolutionary distance
![Page 43: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/43.jpg)
• Saturates at long evolutionary distances(but less so than many think)
• Beware of GC / codon frequency biases(use ML rather than heuristic methods)
• Multiple alignment / tree rather than pairwise for best results
• Slow to estimate accurately
• Missing values (where dS saturates)
dS Caveats
![Page 44: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/44.jpg)
codeml dS accuracy at 400 codons
![Page 45: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/45.jpg)
yn00 dS accuracy at 400 codons
![Page 46: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/46.jpg)
Use all transcripts
![Page 47: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/47.jpg)
PhyOP: transcript trees from dS1. Whole genome alignment identifies
homologues2. codeml for dS calculation 3. Ignore large dS 4. Hierarchical cluster5. Fitch Margoliash modified to handle missing
values to give giant transcript tree6. Heuristics based on lowest dS to select
1 “representative” transcript per gene7. Map Gene tree to species tree
![Page 48: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/48.jpg)
Fitch MargoliashMinimize
Where• dij is the pairwise distance estimate
• pij is the distance between i and j on the tree
Assumes that the error is a fixed proportion of the total distance
(Fitch and Margoliash, 1967) Easily adapted for missing values
![Page 49: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/49.jpg)
PhyOP pipeline Part 1
![Page 50: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/50.jpg)
3 ways in which transcript trees map to genes
• Simple cladesonly 1 transcript per gene in orthologous relationship: most genes
• Unambigous cladesAlternative transcripts are in the same orthologous relationships
![Page 51: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/51.jpg)
• Ambiguous cladesAlternative transcripts are in inconsistent relationships (small proportion)
3 ways in which transcript trees map to genes
![Page 52: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/52.jpg)
Where are most transcripts?
Assumption:Assumption:Most transcripts are not in any sort of
orthologous relationships: their conjugates have not been predicted.
RealityRealityMost transcripts are in the same clade as their
alternative transcripts:Because of shared exons, they are most similar to
their alternatively transcribed siblings.
![Page 53: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/53.jpg)
How to choose between alternative transcripts?
• Use conserved exon boundaries excludes Use conserved exon boundaries excludes exogenous sequenceexogenous sequence
• Use distance to its ortholog (not tree distance Use distance to its ortholog (not tree distance because these will be equal)because these will be equal)high dS means exogenous sequence and will be excludedWith multiple partially overlapping clades, this is more difficult
![Page 54: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/54.jpg)
PhyOP pipeline Part 2
![Page 55: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/55.jpg)
Example
Four alternative transcripts (1-4), 6 dog genes, 3 human genes
• Clade 1 transcriptsDoga1 Dogb1 Dogc1 Dogf1 Humanb1 Humanc1
• Clade 2 transcriptsDogb2 Dogc2 Doge2 Dogf2 Humana2 Humanc2
• Clade 3 transcriptsDoga3 Dogb3 Dogd3 Doge3 Dogf3
Humana3 Humanb3 Humanc3
![Page 56: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/56.jpg)
“Annointing” transcripts to keep: Example
Circularity / boot-strapping problem: The transcript in the other species which is used for “annointing” The transcript in the other species which is used for “annointing”
might itself be discardedmight itself be discarded• Doga1 is closer to Humanb1 than Doga3 to any human
transcript: Keep Doga1 discard Doga3
• Humanb3 is closer to Doge3 than Humanb1 to any dog transcript: Keep Humanb3 discard Humanb1
• Oops. Now no Human transcript is close to Doga1.
• Clade 1 transcriptsDoga1 Dogb1 Dogc1 Dogf1 Humanb1 Humanc1
• Clade 2 transcriptsDogb2 Dogc2 Doge2 Dogf2 Humana2 Humanc2
• Clade 3 transcriptsDoga3 Dogb3 Dogd3 Doge3 Dogf3 Humana3 Humanb3 Humanc3
![Page 57: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/57.jpg)
How to avoid circularity
• Previously: Use mean distance to all other transcripts in the other species. Close eyes. Hope problem goes away.
• Now: 1. Take all transcript pairs from all three clades starting with
the closest dS
2. “Annoint” both transcripts from the pair and throw away all other transcripts
3. Ignore all pairs which involve discarded transcripts4. Recurse5. Complicated by trying to keep merged genes
![Page 58: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/58.jpg)
From transcripts to genes
![Page 59: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/59.jpg)
dS for orthologues
![Page 60: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/60.jpg)
![Page 61: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/61.jpg)
dS distributions can be an indication of orthologue quality
![Page 62: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/62.jpg)
Dog vs. Human Genomes
![Page 63: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/63.jpg)
0
200
400
600
800
1000
1200
1400
1600
0 200 400 600 800 1000 1200
Mouse OR Gene Order
Rat
OR
Gen
e O
rder
Conservation of Gene Order in Mouse / Rat ORs
![Page 64: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/64.jpg)
How to improve on using dS?
• ds better dates the history, but fails for distant homologs.
• dn works for distant homologs, but tends to be subjected to selective pressures.
Can we combine them?Can we combine them?• Full codon evolutionary model would account for
this automatically• Use bootstrapping: if values -> random, no longer
informative
![Page 65: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/65.jpg)
TreeBeST
Tree Building guided by Species Treehttp://treesoft.sourceforge.net/treebest.shtml Heng Li• Tree merge algorithm: merge several trees
that are built from the same alignment with different models.
• Species-aware maximum likelihood:use species phylogeny to correct errors
![Page 66: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/66.jpg)
Maximize use of underlying data5 tree types:
1. Synonymous distance NJ2. Non-Synonymous distance NJ3. P distance NJ4. WAG maximum likelihood5. HKY maximum likelihood
Each predicted from same dataUse bootstrap values to identify optimal branches
using context free grammar
![Page 67: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/67.jpg)
Context Free Grammar in TreeBeST
Given a set of binary rooted trees with the same leaf set V, reconstruct a binary rooted tree such that:
• each branch of the resultant tree comes from one of the given trees
• the resultant tree minimizes a certain objective function
• additivity• topological independence
![Page 68: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/68.jpg)
Maximize use of underlying data
• Switch automatically between – codon: dN, dS;
– nucleotide: HKY and – protein: P-distance
depending on bootstrap
• Fix high probability errors by minimizing distance to species topology
![Page 69: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/69.jpg)
Slide from Heng Li
![Page 70: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/70.jpg)
Slide from Heng Li
Trees reconciled optimally
![Page 71: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/71.jpg)
Is TreeBeST more reliable?
Slide from Heng Li
![Page 72: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/72.jpg)
Caveats
• Bootstrapping may not be the most effective way to test the support for a particular tree given the underlying data
• The underlying data are not the state of the art but cannot use codon + ML for speed
• Limited by multiple alignment• Reconciliation with species tree can mask real
gene losses/duplications
![Page 73: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/73.jpg)
Alternative transcripts reveal merged genes
• Ensembl includes merged genes435 dog346 human
![Page 74: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/74.jpg)
Finding merged genes
![Page 75: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/75.jpg)
What is the best way to deal with alternative transcripts?
• Create virtual transcript
Virtual translation
![Page 76: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/76.jpg)
What is the best way to deal with alternative transcripts?
If two transcripts do not overlap and have homology to each other, they may be tandemly duplicated gene models merged in error
Include both transcripts in pipeline
![Page 77: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/77.jpg)
How to run orthology pipeline for whole genomes
• Take all proteins and cDNA• Make sure correspond exactly, no stop codons,
no genomic mismatches• All vs all blastall• Protein-guided alignments of cDNA• Create virtual translation peptide• Run tree prediction. E.g. TreeBeST• Reconcile with species tree to derive orthology
![Page 78: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/78.jpg)
Predicting orthology gets easier with more genes/species
1. Phylogenetic methods improve in power with more data2. Heuristic / pairwise methods decrease in power /
become more ambiguous with more data
![Page 79: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/79.jpg)
Why is orthology prediction so hard for mammals?
Because gene predictions is so hard
![Page 80: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/80.jpg)
The human genome (euchromatic sequence)
Unknown (old repetitive junk?)
Protein coding: 1.2%UTR: 0.3%
Repeats(Transposable elements, …)~45%
Conserved non-coding (3.5% ?)
Neutral
![Page 81: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/81.jpg)
Signals in DNA are weak
• non-canonical splice sites• promotors without TATA box• introns/exons can have varying lengths• ...probabilistic models:
Hidden Markov Models
![Page 82: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/82.jpg)
Accuracy of ab-initio gene prediction
• Nucleotide level: – 90% sensitivity/90% selectivity
• Exon level: – 70% sensitivity/50% selectivity
• Gene level: – 40% sensitivity/30% selectivity
• False positives: difficult to refute• False negatives: will be missed
![Page 83: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/83.jpg)
Limitations of ab-initio models
• Limited to training set• Limited to model (strange genes)• Problems with long genes• Small exons are difficult to find• Terminal exons are difficult to find
– No splice signals, other signals variable
• e.g. Genscan
![Page 84: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/84.jpg)
Comparative/homology methods
• Add extra data to locate genes• Compare genome to known sequences
– cDNAs– ESTs– Known protein sequences– e.g. Genewise
• Compare genome to other genome– e.g. TwinScan
Same or different organism}
![Page 85: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/85.jpg)
Using cDNAs/ESTs• cDNAs
Provide 3'UTR and 5'UTR. Provide full gene structure.– Expensive and thus rare– Contamination with genomic DNA
• ESTs Cheap and thus plentiful– Highly redundant– Of variable quality– Not complete
• Both: biased towards highly expressed genes
![Page 86: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/86.jpg)
Using cDNAs
5'UTR Exon Intron Exon Intron Exon 3'UTR
cDNA sequence
• Alignment between DNA sequences– Introns and reading frames
![Page 87: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/87.jpg)
Using known protein sequences
5'UTR Exon Intron Exon Intron Exon 3'UTR
Predicted protein sequence
Known protein sequence
Alignment between two protein sequences
Alignment between a “cDNA” to a genome
Implicit cDNA sequence
![Page 88: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/88.jpg)
Using another genome sequence
5'UTR Exon Intron Exon Intron Exon 3'UTR
5'UTR Exon Intron Exon Exon 3'UTRIntron
Genome 2
Genome 1
BLASTN resultsagainst Genome 2
Add evidence to ab-initio modele.g. TwinScan
Align gene models betweenorthologous regionse.g. DoubleScan
![Page 89: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/89.jpg)
Sweet spot for prediction by homology
Guigo et al. (2000)
Ab-Initio
Homology
Sensitivity
Similarity of known protein to target
Homology
Ab-Initio
Specificity
![Page 90: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/90.jpg)
Branches of gene trees scale symmetrically
Ideal world
Real world
Median distance to root
0.0 0.5 1.0 1.5 2.00.0
0.2
0.4
0.6
0.8
1.0
Cum
ulat
ive
freq
uenc
y
Synonymous substitution rate / dS
dana dere dmel dsec dsim dyak
• Variations in branch length
Rate analyses
![Page 91: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/91.jpg)
Sequence conservation between mouse and human genesMouse genome paper Nature 420, 520-562
What orthologous genes should look like
![Page 92: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/92.jpg)
What orthologous genes should look like
• Exons conserved between genomes• UTRs partially conserved between genomes
CGSC (2004)
![Page 93: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/93.jpg)
Gene validations using orthology
• Most genes have orthologues• Almost all genes have mammalian homologs• Exaption of non-coding sequence is rare, especially
for constitutively expressed exons• Conservation of exon-intron structure (number
and phase of exons)• Conservation of length• Conservation of domains• Conservation of synteny
![Page 94: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/94.jpg)
Look carefully at genes• For example: small introns
Introns
Pseudogene?
![Page 95: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/95.jpg)
Conservation of splice sites:
• Insertions / losses of introns are rare• Phase Never changes• Aligned positions should nearly always match
allowing for alignment errors• Valid mismatches may represent insertions
(outside of protein domains)• Find retrogenes
![Page 96: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/96.jpg)
Conservation of splice sites:
• Tandem duplication of non-coding may result in the appearance of splice site conservation
• Check if sequence similarity is absolute
• Check coding potential(Tandem duplicates are often fast evolving genes under positive selection)
![Page 97: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/97.jpg)
Retrogenes
• Loss of introns is due to retrotransposition can be confirmed by loss of synteny (blastz)
• Not all retrogenes are non-functional• Ancient ones are functional• Recent retrogenes can be assumed to be
dead
![Page 98: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/98.jpg)
Gene validations using orthology
![Page 99: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/99.jpg)
Make sure orthology properties look appropriate
dN /dS 0.086dS 1.02
Amino acid sequence identity 81.0%Pairwise alignment coverage 94.2%
Homo sapiens Monodelphis domestica
Number of exons 9 9Sequence length (codons) 471 445Unspliced transcript length (bp) 27,241 25,365G+C content at 4D sites 56.9% 48.7%
Homo Monodelphis 1:1 orthologues
![Page 100: Orthology predictions for whole mammalian genomes](https://reader038.fdocuments.net/reader038/viewer/2022110213/568147e2550346895db517df/html5/thumbnails/100.jpg)
What can you do with orthologs?
Wait for part II