Comparative Genomics I: Tools for comparative...

1

Comparative Genomics I: Tools forcomparative genomics

Ross Hardison, Penn State UniversityJames Taylor: Courant Institute, New York University

Major collaborators: Webb Miller, Francesca Chiaromonte, LauraElnitski, David King, et al., PSU

David Haussler, Jim Kent, Univ. California at Santa CruzIvan Ovcharenko, Lawrence Livermore National Lab

CSH Nov. 11, 2006

Major goals of comparative genomics

• Identify all DNA sequences in a genome that arefunctional– Selection to preserve function– Adaptive selection

• Determine the biological role of each functional sequence• Elucidate the evolutionary history of each type of

sequence• Provide bioinformatic tools so that anyone can easily

incorporate insights from comparative genomics into theirresearch

2

Three major classes of evolution• Neutral evolution

– Acts on DNA with no function– Genetic drift allows some random mutations to become fixed in a

population• Purifying (negative) selection

– Acts on DNA with a conserved function– Signature: Rate of change is significantly slower than that of neutral DNA– Sequences with a common function in the species examined are under

purifying (negative) selection• Darwinian (positive) selection

– Acts on DNA in which changes benefit an organism– Signature: Rate of change is significantly faster than that of neutral DNA

Ideal case for interpretation

Similarity

Position along chromosome

Neutral DNA

Negative selection(purifying)

Positive selection(adaptive)

Exonic segments coding forregions of a polypeptide withcommon function in two species.

Exonic segments coding for regionsof a polypeptide in which change isbeneficial to one of the two species.

3

DCODE.org Comparative Genomics: Align yourown sequences

blastZ multiZ and TBA

zPicture interface for aligning sequences

4

Automated extraction of sequence and annotation

Pre-computed alignment of genomes

• blastZ for pairwise alignments• multiZ for multiple alignment

– Human, chimp, mouse, rat, chicken, dog– Also multiple fly, worm, yeast genomes– Organize local alignments: chains and nets

• All against all comparisons– High sensitivity and specificity

• Computer cluster at UC Santa Cruz– 1024 cpus Pentium III– Job takes about half a day

• Results available at– UCSC Genome Browser http://genome.ucsc.edu– Galaxy server: http://www.bx.psu.edu

Webb Miller

David Haussler

Jim Kent

Schwartz et al., 2003, blastZ, Genome ResearchBlanchette et al., 2004, TBA and multiZ, Genome Research

5

Net

Genome-wide local alignment chains

Mouse

blastZ: Each segment of human is given the opportunity to align with all mouse sequences.

Human: 2.9 Gb assembly. Mask interspersed repeats, break into 300 segments of 10 Mb.

Human

Run blastZ in parallel for all human segments. Collect all local alignments above threshold.

Organize local alignments into a set of chains based on position in assembly and orientation.

Level 1 chainLevel 2 chain

Comparative genomics to find functional sequencesGenomesize

2,900

2,400

2,500

1,200

Human

Mouse Rat

All mammals1000 Mbp

Identifyfunctionalsequences: ~ 145Mbp

million base pairs(Mbp)

FindcommonsequencesblastZ,multiZ

Also birds: 72Mb

Papers in Nature from mouse and rat and chicken genome consortia, 2002, 2004

6

Use measures of alignment quality to discriminatefunctional from nonfunctional DNA

• Compute a conservation score adjusted for the localneutral rate

• Score S for a 50 bp region R is the normalized fraction ofaligned bases that are identical– Subtract mean for aligned ancestral repeats in the

surrounding region– Divide by standard deviation

p = fraction of aligned sites in R that areidentical between human and mouseµ = average fraction of aligned sites that are identical in aligned ancestral repeats inthe surrounding region

n = number of aligned sites in RWaterston et al., Nature

Decomposition of conservation score intoneutral and likely-selected portions

Neutral DNA (ARs)All DNALikely selected DNAAt least 5-6%

S is the conservation score adjusted for variation in the local substitution rate.The frequency of the S score for all 50bp windows in the human genome is shown.From the distribution of S scores in ancestral repeats (mostly neutral DNA), cancompute a probability that a given alignment could result from locally adjustedneutral rate.

Waterston et al., Nature

7

DNA sequences of mammalian genomes• Human: 2.9 billion bp, “finished”

– High quality, comprehensive sequence, very few gaps• Mouse, rat, dog, oppossum, chicken, frog etc. etc etc.• About 40% of the human genome aligns with mouse

– This is conserved, but not all is under selection.• About 5-6% of the human genome is under purifying selection since the

rodent-primate divergence• About 1.2% codes for protein• The 4 to 5% of the human genome that is under selection but does not

code for protein should have:– Regulatory sequences– Non-protein coding genes (UTRs and noncoding RNAs)– Other important sequences

Leveragemany species

to improveaccuracy andresolution ofsignals forconstraint

ENCODE multi-speciesalignment groupMargulies et al., 2007

8

Score multi-species alignments for featuresassociated with function

• Multiple alignment scores– Margulies et al. (2003) Genome Research 13: 2505-2518– Binomial, parsimony

• PhastCons– Siepel et al. (2005) Genome Research 15:1034-1050– Phylogenetic Hidden Markov Model– Posterior probability that a site is among the most highly

conserved sites• GERP

– Cooper et al. (2005) Genome Research 15:901-913– Genomic Evolutionary Rate Profiling– Measures constraint as rejected substitutions = nucleotide

substitution deficits

phastCons: Likelihood of being constrained

Siepel et al. (2005) GenomeResearch 15:1034-1050

• Phylogenetic HiddenMarkov Model

• Posterior probability thata site is among themost highly conservedsites

• Allows for variation inrates along lineages

c is “conserved” (constrained)n is “nonconserved” (aligns butis not clearly subject topurifying selection)

9

Larger genomeshave more of the

constrainedDNA in

noncodingregions

Siepel et al. 2005,Genome Research

Some constrained introns are editing complementaryregions:GRIA2


10

3’UTRs can be highly constrained over largedistances


3’ UTRs contain RNA processing signals, miRNA targets,other regions subject to constraints

Ultraconserved elements = UCEs• At least 200 bp with no interspecies differences

– Bejerano et al. (2004) Science 304:1321-1325– 481 UCEs with no changes among human, mouse and rat– Also conserved between out to dog and chicken– More highly conserved than vast majority of coding regions

• Most do not code for protein– Only 111 out of 481overlap with protein-coding exons– Some are developmental enhancers.– Nonexonic UCEs tend to cluster in introns or in vicinity of genes

encoding transcription factors regulating development– 88 are more than 100 kb away from an annotated gene; may be

distal enhancers

11

GO category analysis of UCE-associated genes

• Genes in which acoding exon overlaps aUCE– 91 Type I genes– RNA binding and

modification– Transcriptional

regulation• Genes in the vicinity of

a UCE (no overlap ofcoding exons)– 211 Type II genes– Transcriptional

regulation– Developmental

regulators Bejerano et al. (2004) Science

Intronic UCE in SOX6 enhances expressionin melanocytes in transgenic mice

Pennacchio et al.,http://enhancer.lbl.gov/

UCEsTested UCEs

12

The most stringently conservedsequences in eukaryotes are mysteries

• Yeast MATa2 locus– Most conserved region in 4 species of yeast– 100% identity over 357 bp– Role is not clear

• Vertebrate UCEs– More constrained than exons in vertebrates– Noncoding UCEs are not detectable outside chordates, whereas coding

regions are• Were they fast-evolving prior to vertebrate/invertebrate divergence?• Are they chordate innovations? Where did they come from?

– Role of many is not clear; need for 100% identity over 200 bp is notobvious for any

• What molecular process requires strict invariance for at least 200 nucleotides?• One possibility: Multiple, overlapping functions

Finding and analyzing genome data

NCBI Entrez http://www.ncbi.nlm.nih.govEnsembl/BioMart http://www.ensembl.orgUCSC Table Browser http://genome.ucsc.eduGalaxy http://www.bx.psu.edu

13

Browsers vs Data Retrieval

• Browsers are designed to show selected information on one locus orregion at a time.– UCSC Genome Browser– Ensembl

• Run on top of databases that record vast amounts of information.• Sometimes need to retrieve one type of information for many

genomics intervals or genome-wide.• Access this by querying on the tables in the databases or “data

marts”– UCSC Table Browser– EnsMart or BioMart– Entrez at NCBI

Retrieve all the protein-coding exons in humans

14

Galaxy: Data retrieval and analysis

• Data can be retrieved from multipleexternal sources, or uploaded fromuser’s computer

• Hundreds of computational tools– Data editing– File conversion– Operations: union, intersection,

complement …– Compute functions on data– Statistics– EMBOSS tools for sequence

analysis– PHYLIP tools for molecular

evolutionary analysis– PAML to compute substitutions per

site• Add your own tools

Galaxy via Table Browser: coding exons

15

Retrieve human mutations

Find exons with human mutations: Intersection

16

Compute length using “expression”

Statistics on exon lengths

17

Plot a histogram of exon lengths

Distribution of (human mutation) exon lengths

18

What is that really long exon? Sort by length

SACS has an 11kb exon

Comparative Genomics I: Tools for comparative...

Documents

Transcript of Comparative Genomics I: Tools for comparative...