Evolution and the Santa Cruz Genome Browser

Post on 06-Jan-2016

29 views 2 download

description

Evolution and the Santa Cruz Genome Browser. Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz Pennsylvania State University. Typical Gene Level View:. Sialic Acid Binding/Ig-like Lectin 7. Typical Gene Level View:. Sialic Acid Binding/Ig-like Lectin 7. - PowerPoint PPT Presentation

Transcript of Evolution and the Santa Cruz Genome Browser

Evolution and the Santa Cruz Genome Browser

Jim Kent and the Genome Bioinformatics Group

University of California Santa Cruz

Pennsylvania State University

Typical Gene Level View:

Sialic Acid Binding/Ig-like Lectin 7

Typical Gene Level View:

Sialic Acid Binding/Ig-like Lectin 7

Known Gene Details Page

Known Gene Details Page

PDB Ribbon Diagram

4 clicks away by the wonder of the world wide web

Hox A Cluster, Many Tracks

Track Controls are Now Grouped

Packed mode saves space, makes labels easier to find.

Squished mode is ideal for ESTs and mouse/human homology

Squished mode is ideal for ESTs and mouse/human homology

ESTs hint at a smallerversion of exon2

Publication Quality Output

Comparative Genomics

Chaining Alignments

• Chaining bridges the gulf between syntenic blocks and base-by-base alignments.

• Local alignments tend to break at transposon insertions, inversions, duplications, etc.

• Global alignments tend to force non-homologous bases to align.

• Chaining is a rigorous way of joining together local alignments into larger structures.

Chains join together related local alignments

Protease Regulatory Subunit 3

Affine penalties are too harsh for long gaps

Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine

gap scores model red/blue plots as straight lines.

Gaps are needed in Both Sequences in the General Case of Pair-Wise Alignment

otherwise non-homologous bases can be forced to pair

2-D histogram of observed gaps.

The horizontal axis is gaps in human, the vertical axis is gaps in mouse. The logarithm of counts of gaps in bins of 10 (left) and bins of 500 (right) are plotted as levels of gray with black representing the highest counts. Note the concentration of gaps along the axis, particularly for shorter gaps.

Before and After Chaining

Chaining Algorithm

• Input - blocks of gapless alignments from blastz• Dynamic program based on the recurrence

relationship: score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))

• Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands)

j<i

Netting Alignments

• Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions.

• Net finds best match mouse match for each human region.

• Highest scoring chains are used first.• Lower scoring chains fill in gaps within

chains inducing a natural hierarchy.

Net Focuses on Ortholog

Net highlights rearrangements

A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

Useful in finding pseudogenes

Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!

Mouse/HumanRearrangement Statistics

Number of rearrangements of given type per megabase.

A Rearrangement Hot Spot

Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

year of the rat - 2008

Rat Genome

Rat/Mouse/Human Genome-Wide Multiz Alignments Available

Eye lense protein gamma crystallin a. Upstream region (on right) is highly conserved but not a CpG island. Alignments are interrupted by numerous recent transposon insertions.

Details page offers quick access to browsers on corresponding regions of other genomes. It also highlights exons in base-by-base alignments.

Zoom to Base Level

Detail near translation start of tubulin 8

Zoom to Base Level

Intron consensus sequence visible.

Zoom to Base Level

Possible alt-splice not consensus and not conserved.

Tiling the genome in Microarrays

New genes on 21 and 22?

Cross-hybridization at Work

Zoomed in on right side:

>hg15_rnaCluster_chr22.246 range=chr22:25204375-25204574 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=noneaactccgcctcggggccccggggcgccgcctctctcccccggggcgccgcctctctcccccggggcgccgcctccctccgccgcggccgtcgagccgcggagcgcctcttccgcggagccgccgcctgccaggattccagcgccgcagctgcggccgcagccattggtctctgacgtcagcggcgtgcggcgcactcggc>hg15_rnaCluster_chr22.234 range=chr22:24125896-24126095 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=noneccagggcagggcgaggagcgcggggaggggccgcggggacccgggccgctggggccgtggggcccgcccggccgccggccggctccctggggcgcgggcggctgcgtcagcggggggcggagacgcggcgctgcttccgctcacgcgcgccctgctccctcctcccagtcgtcctggtccgcggcgcccaacggggaaga>hg15_rnaCluster_chr22.313 range=chr22:29356156-29356355 5'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=nonegccctcccggtccgggggcggggcttggcctggggcggggcttggctggggtgctcagcccaattttccgtgtagggagcgggcggcggcgggggaggcagaggcggaggcggagtcaagagcgcaccgccgcgcccgccgtgccgggcctgagctggagccgggcgtgagtcgcagcaggagccgcagccggagtcaca>hg15_rnaCluster_chr22.337 range=chr22:30433286-30433485 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=noneactcagaagctaagataccgacggtgttcctctgaacttcttccaatggctaaaagctacaagcgcctcagatataaaagactcctggacggattttcatccagcacagagcagctgaatccatatttggcagctagtggatgggataagaggcctaacagtaagcccatggcactttattctctcgaatccatcaagat>hg15_rnaCluster_chr22.356 range=chr22:32640965-32641164 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=noneggccccgcgccccaggccggggcgaggccttttccggcgcttctttcccgcggagccgcgggcgggcggcgcaggccctgggggagagcgcgccgcggccggttgcagccccccccgcgccgccgcgttcggcgcccggcccggccagtctgctcctgccccgccgccgcgccggagcccgggcgcccgaagctgggggc

200 Bases Upstream of Known Genes 5’ Extended by RNA/EST clusters

AcknowledgementsIndividuals Institutions

NHGRI, The Wellcome Trust, HHMI, NCI, Taxpayers in the US and worldwide.

Baylor, Sanger, Wash U, Whitehead, Stanford, JGI/ DOE, Oklahoma U and the international sequencing centers.

UCSC, NCBI, EBI, Ensembl, Genoscope, MGC, Intel, TIGR, Jackson Labs, Affymetrix, SwissProt.

Webb Miller, Chuck Sugnet, Robert Baertsch, Scott Schwartz, Fan Hsu, Terry Furey, Ross Hardison, David Haussler,

Richard Gibbs, Bob Waterston, Eric Lander, Francis Collins,

LaDeana Hillier, Roderic Guigo, Michael Brent, Olivier Jaillon, David Kulp, Victor Solovyev, Ewan Birney, James Gilbert, Greg Schuler, Deanna Church, the Gene Cats.

Everyone else!

THE END

A Cautionary Note

• Infant digestive systems very permeable, uptake antibodies

• ~10% of infants are allergic to cow’s milk based formula

• These infants get soy/corn based formula

• As we engineer plants, let’s be careful what we put in infant formula

New Algorithms and Data

• ‘Chaining’ and ‘netting’ of mouse/human alignments precisely define orthology and quantify rearrangements.

• Rat genome is browsable and used in rat/mouse/human multiple alignments.

• Cross-hybridization potential of Affymetrix-style microarrays calculated and displayed.

Ideal Gap Penalties

• Would allow gaps in both sequences at once• Would penalize long gaps less than affine gap

scores.• Still would be quick to compute.

• We use a piecewise linear function of the sum of gap sizes plus a substantial penalty for gaps that are in both sequences at once.