Comparative Genomics
description
Transcript of Comparative Genomics
![Page 1: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/1.jpg)
Comparative Genomics
Ross Hardison, Penn State University
Major collaborators: Webb Miller, Francesca Chiaromonte, Laura Elnitski, David King, et al.,
PSUJames Taylor: Courant Institute, New York
University David Haussler, Jim Kent, Univ. California at
Santa CruzIvan Ovcharenko, Lawrence Livermore National LabPSU Nov. 28, 2006
![Page 2: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/2.jpg)
Major goals of comparative genomics
• Identify all DNA sequences in a genome that are functional– Selection to preserve function– Adaptive selection
• Determine the biological role of each functional sequence
• Elucidate the evolutionary history of each type of sequence
• Provide bioinformatic tools so that anyone can easily incorporate insights from comparative genomics into their research
![Page 3: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/3.jpg)
Three major classes of evolution
• Neutral evolution– Acts on DNA with no function– Genetic drift allows some random mutations to become
fixed in a population• Purifying (negative) selection
– Acts on DNA with a conserved function– Signature: Rate of change is significantly slower than
that of neutral DNA– Sequences with a common function in the species
examined are under purifying (negative) selection• Darwinian (positive) selection
– Acts on DNA in which changes benefit an organism– Signature: Rate of change is significantly faster than
that of neutral DNA
![Page 4: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/4.jpg)
Ideal case for interpretation
Similarity
Position along chromosome
Neutral DNA
Negative selection(purifying)
Positive selection(adaptive)
Exonic segments coding for regions of a polypeptide with common function in two species.
Exonic segments coding for regions of a polypeptide in which change is beneficial to one of the two species.
![Page 5: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/5.jpg)
Taxonomic distribution of homologs of mouse proteins
Waterston et al.
![Page 6: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/6.jpg)
Conservation in different parts of genes
Waterston et al, Mouse Genome, Nature
Average percent identity (black) or percent aligned (blue) for 10,000 orthologous genes
![Page 7: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/7.jpg)
Levels of conservation (Human vs Mouse) in different types of proteins
Black: all orthologous proteins (Hum-mouse)12,845 1:1 gene pairs
Red: proteins with recognized domainsGray: proteins without recognized domains
Black: Nuclear proteinsRed: Cytoplasmic proteinsGray: Extracellular proteins; positive,
diversifying selection
KA= rate of nonsynonymous substitutionsKS= rate of synonymous substitutionsWaterston et al. Nature 2002
![Page 8: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/8.jpg)
Rat-specific gene expansions• Genes that have expanded in number in rats are
enriched in– Immune function/ antigen recognition
• immunoglobulins, T-cell receptor alpha– Detoxification
• cytochrome P450– Reproduction
• alpha2u-globulin– Olfaction and odorant detection
• Olfactory receptors• Also are rapidly evolving• Segmental duplications are enriched for the same
genes
Rat Genome SPC 2004 Nature
![Page 9: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/9.jpg)
Adaptive remodeling of gene clusters
Figure 13 Adaptive remodeling of genomes and genes. a, Orthologous regions of rat, human and mouse genomes encoding pheromone-carrier proteins of the lipocalin family (a2u-globulins in rat and major urinary proteins in mouse) shown in brown. Zfp37-like zinc finger genes are shown in blue. Filled arrows represent likely genes, whereas striped arrows represent likely pseudogenes. Gene expansions are bracketed. Arrowhead orientation represents transcriptional direction. Flanking genes 1 and 2 are TSCOT and CTR1, respectively.
Rat Genome SPC 2004 Nature
![Page 10: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/10.jpg)
DCODE.org Comparative Genomics: Align your own sequences
blastZ multiZ and TBA
![Page 11: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/11.jpg)
zPicture interface for aligning sequences
![Page 12: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/12.jpg)
Automated extraction of sequence and annotation
![Page 13: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/13.jpg)
Pre-computed alignment of genomes
• blastZ for pairwise alignments• multiZ for multiple alignment
– Human, chimp, mouse, rat, chicken, dog– Also multiple fly, worm, yeast genomes– Organize local alignments: chains and
nets• All against all comparisons
– High sensitivity and specificity• Computer cluster at UC Santa Cruz
– 1024 cpus Pentium III – Job takes about half a day
• Results available at– UCSC Genome Browser
http://genome.ucsc.edu– Galaxy server: http://www.bx.psu.edu
Webb Miller
David Haussler
Jim Kent
Schwartz et al., 2003, blastZ, Genome ResearchBlanchette et al., 2004, TBA and multiZ, Genome Research
![Page 14: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/14.jpg)
Net
Genome-wide local alignment chains
Mouse
blastZ: Each segment of human is given the opportunity to align with all mouse sequences.
Human: 2.9 Gb assembly. Mask interspersed repeats, break into 300 segments of 10 Mb.
Human
Run blastZ in parallel for all human segments. Collect all local alignments above threshold.
Organize local alignments into a set of chains based on position in assembly and orientation.
Level 1 chainLevel 2 chain
![Page 15: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/15.jpg)
Comparative genomics to find functional sequences
Genome size
2,900
2,400
2,500
1,200
Human
Mouse Rat
All mammals1000 Mbp
Identify functional sequences: ~ 145 Mbp
million base pairs(Mbp)
Find common sequencesblastZ, multiZ
Also birds: 72Mb
Papers in Nature from mouse and rat and chicken genome consortia, 2002, 2004
![Page 16: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/16.jpg)
Use measures of alignment quality to discriminate functional from
nonfunctional DNA• Compute a conservation score adjusted for the local neutral rate
• Score S for a 50 bp region R is the normalized fraction of aligned bases that are identical – Subtract mean for aligned ancestral repeats in the surrounding region
– Divide by standard deviation
p = fraction of aligned sites in R that areidentical between human and mouse
= average fraction of aligned sites that are identical in aligned ancestral repeats inthe surrounding region
n = number of aligned sites in RWaterston et al., Nature
![Page 17: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/17.jpg)
Decomposition of conservation score into neutral and likely-selected
portions
Neutral DNA (ARs)All DNALikely selected DNAAt least 5-6%
S is the conservation score adjusted for variation in the local substitution rate.The frequency of the S score for all 50bp windows in the human genome is shown.From the distribution of S scores in ancestral repeats (mostly neutral DNA), can compute a probability that a given alignment could result from locally adjusted neutral rate.
Waterston et al., Nature
![Page 18: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/18.jpg)
DNA sequences of mammalian genomes
• Human: 2.9 billion bp, “finished”– High quality, comprehensive sequence, very few gaps
• Mouse, rat, dog, oppossum, chicken, frog etc. etc etc.• About 40% of the human genome aligns with mouse
– This is conserved, but not all is under selection.• About 5-6% of the human genome is under purifying
selection since the rodent-primate divergence• About 1.2% codes for protein• The 4 to 5% of the human genome that is under
selection but does not code for protein should have:– Regulatory sequences– Non-protein coding genes (UTRs and noncoding RNAs)– Other important sequences
![Page 19: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/19.jpg)
Conservation score S
in different types of regions
Red: Ancestral repeats (mostly neutral)Blue: First class in labelGreen: Second class in label
Waterston et al., Nature
![Page 20: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/20.jpg)
Leverage many
species to improve accuracy
and resolution of signals
for constraint
ENCODE multi-species alignment groupMargulies et al., 2007
![Page 21: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/21.jpg)
Coverage of human by alignments with other vertebrates ranges from 1% to 91%
Human
0 20 40 60 80 100
Fugu
Tetraodon
Zebrafish
Frog
Chicken
Platypus
Opossum
Cow
Dog
Rat
Mouse
Chimp
Percent of human aligning with second species
5.491
92
310360
450
173
Millions ofyears
220
5%
![Page 22: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/22.jpg)
0
10
20
30
40
50
60
70
80
90
100
0 100 200 300 400 500Time of divergence from common ancestor to human,
Myr ago
Distinctive divergence rates for different types of functional DNA
sequences
0
10
20
30
40
50
60
70
80
90
100
0 100 200 300 400 500Time of divergence from common ancestor to
human, Myr ago
GenomeCoding exonsUltraconserved (HM)Log. (Genome)
![Page 23: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/23.jpg)
Large divergence in cis-regulatory modules from opossum to platypus
0
10
20
30
40
50
60
70
80
90
100
0 100 200 300 400 500Time of divergence from common ancestor to
human, Myr ago
GenomeKnown regulatory regionsCpG islandsFunctional promotersCoding exonsUltraconserved (HM)
![Page 24: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/24.jpg)
cis-Regulatory modules conserved from human to fish
310
450
91
173
Millions ofyears
• About 20% of CRMs• Tend to regulate genes
whose products control transcription and development
![Page 25: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/25.jpg)
cis-Regulatory modules conserved in eutherian mammals and marsupials
310
450
91
173
Millions ofyears
• Human-marsupial alignments capture about 60% of CRMs– Tend to occur close to genes
involved in aminoglycan synthesis, organelle biosynthesis
• Human-mouse alignments capture about 87% of CRMs– Tend to occur close to genes
involved in apoptosis, steroid hormone receptors, etc.
• Within aligned noncoding DNA of eutherians, need to distinguish constrained DNA (purifying selection) from neutral DNA.
![Page 26: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/26.jpg)
Score multi-species alignments for features associated with function
• Multiple alignment scores – Margulies et al. (2003) Genome Research 13: 2505-2518
– Binomial, parsimony • PhastCons
– Siepel et al. (2005) Genome Research 15:1034-1050– Phylogenetic Hidden Markov Model– Posterior probability that a site is among the most highly conserved sites
• GERP– Cooper et al. (2005) Genome Research 15:901-913– Genomic Evolutionary Rate Profiling– Measures constraint as rejected substitutions = nucleotide substitution deficits
![Page 27: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/27.jpg)
phastCons: Likelihood of being constrained
Siepel et al. (2005) Genome Research 15:1034-1050
• Phylogenetic Hidden Markov Model
• Posterior probability that a site is among the most highly conserved sites
• Allows for variation in rates along lineages
c is “conserved” (constrained)n is “nonconserved” (aligns but is not clearly subject to purifying selection)
![Page 28: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/28.jpg)
Larger genomes have more of the constrained
DNA in noncoding regions
Siepel et al. 2005, Genome Research
![Page 29: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/29.jpg)
Some constrained introns are editing complementary regions:GRIA2
Siepel et al. 2005, Genome Research
![Page 30: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/30.jpg)
3’UTRs can be highly constrained over large distances
Siepel et al. 2005, Genome Research
3’ UTRs contain RNA processing signals, miRNA targets,other regions subject to constraints
![Page 31: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/31.jpg)
Ultraconserved elements = UCEs• At least 200 bp with no interspecies differences– Bejerano et al. (2004) Science 304:1321-1325 – 481 UCEs with no changes among human, mouse and rat– Also conserved between out to dog and chicken– More highly conserved than vast majority of coding regions
• Most do not code for protein – Only 111 out of 481overlap with protein-coding exons– Some are developmental enhancers.– Nonexonic UCEs tend to cluster in introns or in vicinity of genes encoding transcription factors regulating development
– 88 are more than 100 kb away from an annotated gene; may be distal enhancers
![Page 32: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/32.jpg)
GO category analysis of UCE-associated genes
• Genes in which a coding exon overlaps a UCE– 91 Type I genes– RNA binding and modification
– Transcriptional regulation
• Genes in the vicinity of a UCE (no overlap of coding exons)– 211 Type II genes– Transcriptional regulation
– Developmental regulators
Bejerano et al. (2004) Science
![Page 33: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/33.jpg)
Intronic UCE in SOX6 enhances expression in melanocytes in
transgenic mice
Pennacchio et al., http://enhancer.lbl.gov/
UCEsTested UCEs
![Page 34: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/34.jpg)
The most stringently conserved sequences in eukaryotes are
mysteries • Yeast MATa2 locus
– Most conserved region in 4 species of yeast– 100% identity over 357 bp– Role is not clear
• Vertebrate UCEs– More constrained than exons in vertebrates– Noncoding UCEs are not detectable outside chordates,
whereas coding regions are• Were they fast-evolving prior to vertebrate/invertebrate divergence?
• Are they chordate innovations? Where did they come from?– Role of many is not clear; need for 100% identity over
200 bp is not obvious for any• What molecular process requires strict invariance for at least 200
nucleotides?• One possibility: Multiple, overlapping functions
![Page 35: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/35.jpg)
Use measures of alignment texture to discriminate functional classes of
DNA• Mouse Cons track (L-scores) are measures of
alignment quality.– Match > Mismatch > Gap
• Alternatively, can analyze the patterns within alignments (texture) to try to distinguish among functional classes– Regulatory regions vs bulk DNA– Patterns are short strings of matches, mismatches,
gaps– Find frequencies for each string using training
sets• 93 known regulatory regions• 200 ancestral repeats (neutral)
• Regulatory potential genome-wide– Elnitski et al. (2003) Genome Research 13: 64-72.
![Page 36: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/36.jpg)
1. Collapse the alignment to a small alphabet, e.g.Match involving G or C = S Transition = I Gap = GMatch involving A or T = W Transversion = V
Alignment seq1 G T A C C T A C T A C G C A seq2 G T G T C G - - A G C C C ACollapsed alphabet S W I I S V G G V I S V S W
Evaluate patterns in alignments to discriminate functional classes of DNA
2. Is a pattern, e.g., SWIIS followed by V found more frequently in alignments of
known cis-regulatory modules (set of 93) or neutral DNA (200 ancestral repeats)?
3. The regulatory potential for any alignment is a log-likelihood estimate of the extent to which its patterns are more like those in regulatory regions than in neutral DNA.
5/101/6 = 3 1/4
2/8 = 1 1/43/6 = 0.5
![Page 37: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/37.jpg)
Regulatory potential (RP) to distinguish functional classes
![Page 38: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/38.jpg)
Good performance of regulatory potential (RP) for finding cis-
regulatory modules
Taylor et al. (2006) Genome Research, in press (October or November)
![Page 39: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/39.jpg)
Genes Co-expressed in Late Erythroid Maturation
G1E-ER cells: proerythroblast line lacking the transcription factor GATA-1. Can rescue by expressing an estrogen-responsive form of GATA-1Rylski et al., Mol Cell Biol. 2003
![Page 40: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/40.jpg)
Predicted cis-Regulatory Modules (preCRMs) Around Erythroid Genes
![Page 41: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/41.jpg)
Conservation of predicted binding sites for
transcription factorsBinding site for GATA-1
See poster from Yuepin Zhou, Yong Cheng, Hao Wang et al.
![Page 42: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/42.jpg)
preCRMs with conserved consensus GATA-1 BS tend to be active on transfected
plasmids
![Page 43: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/43.jpg)
preCRMs with conserved consensus GATA-1 BS tend to be active after integration into a chromosome
![Page 44: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/44.jpg)
Examples of validated preCRMs
![Page 45: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/45.jpg)
Correlation of Enhancer Activity with RP Score
![Page 46: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/46.jpg)
Validation status for 99 tested fragments
![Page 47: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/47.jpg)
preCRMs with High RP and Conserved Consensus GATA-1 Tend To Be
Validated
![Page 48: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/48.jpg)
Conclusions
• Multispecies alignments can be used to predict whether a sequence is functional (signature of purifying selection).
• Patterns in alignments and conservation of some TFBSs can be used to predict some cis-regulatory elements.
• The predictions of cis-regulatory elements for erythroid genes are validated at a good rate.
• Databases and servers such as the UCSC Table Browser, Galaxy, and others provide access to these data.– http://genome.ucsc.edu/– http://www.bx.psu.edu/
![Page 49: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/49.jpg)
Many thanks …
Wet Lab: Yuepin Zhou, Hao Wang, Ying Zhang, Yong Cheng, David King
PSU Database crew: Belinda Giardine, Cathy Riemer, Yi Zhang, Anton Nekrutenko
Alignments, chains, nets, browsers, ideas, …Webb Miller, Jim Kent, David Haussler
RP scores and other bioinformatic input:Francesca Chiaromonte, James Taylor, Shan Yang, Diana Kolbe, Laura Elnitski
Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU
![Page 50: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/50.jpg)
Regulatory Potential (RP) features
V GAP MGC MGC MGCMGC MAT MATMATMGC T T TGAP
MAT, MGC, V, T, GAPMAT-MAT-MAT-MAT-MAT * * * * *MAT-MAT-MAT-MAT-MGC * * * * *..MAT-T -T -MGC-V * * * * 0.001..
Positive Training set-93 known CRMs MAT, MGC, V, T, GAPMAT-MAT-MAT-MAT-MAT * * * * *MAT-MAT-MAT-MAT-MGC * * * * *..MAT-T -T -MGC-V * * * * 0.0001..
Negative Training set-200 ancestral repeats
Alignment Hum G T A C C T A C T A C C C A Mus G T G T C G - - A G C C C A
Computation of 2-way RP score using 5-symbol, 5th order Markov model MAT, MGC, V, T, GAPMAT-MAT-MAT-MAT-MAT * * * * *MAT-MAT-MAT-MAT-MGC * * * * *..MAT-T -T -MGC-V * * * * ln(10)..
To measure how much more likely an alignment is regulatory as compared with netural, the log-odds ratios for each symbol over the entire length of the alignments are summedand normalized for the length of the alignments
A score matrix is formed by taking log-odds ratio
![Page 51: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/51.jpg)
Finding and analyzing genome data
NCBI Entrez http://www.ncbi.nlm.nih.govEnsembl/BioMart http://www.ensembl.org UCSC Table Browser http://genome.ucsc.eduGalaxy http://www.bx.psu.edu
![Page 52: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/52.jpg)
Browsers vs Data Retrieval
• Browsers are designed to show selected information on one locus or region at a time.– UCSC Genome Browser– Ensembl
• Run on top of databases that record vast amounts of information.
• Sometimes need to retrieve one type of information for many genomics intervals or genome-wide.
• Access this by querying on the tables in the databases or “data marts”– UCSC Table Browser– EnsMart or BioMart– Entrez at NCBI
![Page 53: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/53.jpg)
Retrieve all the protein-coding exons in humans
![Page 54: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/54.jpg)
Galaxy: Data retrieval and analysis
• Data can be retrieved from multiple external sources, or uploaded from user’s computer
• Hundreds of computational tools– Data editing– File conversion– Operations: union,
intersection, complement …– Compute functions on data– Statistics– EMBOSS tools for sequence
analysis– PHYLIP tools for molecular
evolutionary analysis– PAML to compute
substitutions per site• Add your own tools
![Page 55: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/55.jpg)
Galaxy via Table Browser: coding exons
![Page 56: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/56.jpg)
Retrieve human mutations
![Page 57: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/57.jpg)
Find exons with human mutations: Intersection
![Page 58: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/58.jpg)
Compute length using “expression”
![Page 59: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/59.jpg)
Statistics on exon lengths
![Page 60: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/60.jpg)
Plot a histogram of exon lengths
![Page 61: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/61.jpg)
Distribution of (human mutation) exon lengths
![Page 62: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/62.jpg)
What is that really long exon? Sort by length
![Page 63: Comparative Genomics](https://reader036.fdocuments.net/reader036/viewer/2022082207/568167f8550346895ddd75d6/html5/thumbnails/63.jpg)
SACS has an 11kb exon