GenBank Huge amounts of data, easily accessible

1975 1980 1985 1990 1995 2000 2005

Cumulative number

Rate of growth of phylogenetic knowledge

Number of papers with “molecular” and “phylogeny” in Web of Science

Number of studies in TreeBASE

Why have a phylogeny database?• Archive data and trees (repeat old analyses with new

tools)• Synthesize new data sets and trees (supermatrices and

supertrees)• Big scale questions (tree shape, bias in tree building

methods, stability of trees over time)• Hypothesis testing : find all phylogenies for taxa with

members in Gondwana -- do they show similar area cladograms, amounts of sequence divergence, etc.

• Who knows…(we won’t know until we try)

Obstacles in the way

• Ontologies (consistent names for organisms, genes, and other kinds of data)

• How to store and query trees (what kind of queries do we want?)

• Summarising information in trees (supertrees) and matrices (supermatrices)

• Visualising very big trees

Peruvian Diving-petrel(or, what’s in a name?)

• ITIS Pelecanoides garnotii

• NCBI Pelecanoides garnoti

• TreeBASE Pelecanoides garnoti AF076073

TreeBASE Names Projecthttp://darwin.zoology.gla.ac.uk/~rpage/TreeBASE/

• Aim is to map every name in TreeBASE onto a valid taxonomic name (i.e., a name in a database, or in the literature)

• Use exact-, substring-, and approximate string matching (+ BLAST)

• So far 26819 out of 35084 names mapped

Hemideina maori (weta)

18 TreeBASE names = 1 real name

3 TreeBASE names = 1 real name

catodon

catadon

macrocephalus

Physeter catodon (Sperm Whale)

The case of the Harp sealTreeBASE and GenBank have harp seals under two different names, only ITIS knows that they are the same thing

• There are known knowns, things we know that we know

• There are known unknowns, things we now know we don’t know

• But there are also unknown unknowns, things we do not know we don't know

Why taxonomy matters

(or vs. )

Searching on “Aves” in TreeBASEfinds 4 studies with birds…

• Study #1: Gauthier, J., A.G. Kluge, and T. Rowe. 1988. Amniote phylogeny and the importance of fossils.

• Study #2: Harshman, J., C. J. Huddleston, J. P. Bollback, T. J. Parsons, and M. J. Braun. 2003 inpress. True and False Gavials: A Nuclear Gene Phylogeny of Crocodylia.

• Hedges, S. B., K. D. Moberg, and L. R. Maxson.1990. Tetrapod phylogeny inferred from 18s and 28s ribosomal RNA sequences and a review of the evidence for amniote relationships.

• van Dijk, M. A. M., E. Paradis, F. Catzeflis, and W. de Jong. 1999. The virtues of gaps: Xenarthran (Edentate) monophyly supported by a unique deletion in alphaA-crystallin.

…but there are other birds in TreeBASE!

Tree space in TreeBASE (overlap = 1)

There are 24 bird studies in TreeBASE, but “tree surfing” won’t find them

Fig. 1. The `data availability matrix' for green plant protein sequences from GenBank (release 132). A set of 130304 sequences for 14667 species sequences were clustered into 61117 groups of homologous proteins by a combination of BLAST and single-linkage clustering (using the program Blastclust from the NCBI Blast toolkit: http://www.ncbi.nlm.nih.gov/BLAST/ ). A column represents a protein or protein family; a row represents one of the species in the dataset; and a dot indicates the existence of a sequence for that species and protein. Species are sorted vertically by their number of sequences; the most-represented species ( Arabidopsis thaliana ) is at the top. Proteins are sorted horizontally by the number of taxa for which they have been sequenced; the most heavily sequenced gene ( rbcL ) is on the right. This figure shows the most heavily sampled corner of the data availability matrix; the remainder of the matrix is even more sparse.

Arabidopsis

Seeing the tree (best seen when printed on 1.5 m wide paper…)

http://darwin.zoology.gla.ac.uk/~rpage/MyToL/www

Demo 1

QuickTime™ and aBMP decompressor

are needed to see this picture.

Demo 2

Comparing classificationsfor Psocoptera

NCBI (GenBank)9 species

Lienhard & Smithers (2002)[courtesy of Kevin Johnson]

4363 species

www.biomoby.orgwww.gmod.org

Bioinformatics envy - GenBank should NOT be our role model

From journal to database…

1975 1980 1985 1990 1995 2000 2005

Cumulative number

Problem: not enough data and trees in journals make it into databases

Elsevier’s journal Molecular Phylogenetics and Evolution is a

criminal waste of our efforts

Text, data, trees locked up in paper and PDF

… the database is the journal

1. Data + trees go into database

2. Text (annotation) added

3. Automatically generate a report summarising the results

4. The report is the publication (can have a DOI)

5. Open Access data and text

“Oh, the vision thing” George Bush (snr), 1987

GenBank Huge amounts of data, easily accessible

Documents

Transcript of GenBank Huge amounts of data, easily accessible

By: Brandon Head. A wildfire is a massive fire caused in the forest. Wildfires destroy huge amounts of land and can cause a large amounts of casualties.

Developing a Database for GenBank Information

OCEAN CIRCULATION - people.ucsc.edu€¦ · 1) Ocean Layers • Ocean is strongly Stratified • Consists of distinct LAYERS – controlled by density •takes huge amounts of energy

GENBANK +Bioedit

Acanthamoeba spp. - University of Adelaide · 2008. 8. 28. · Acanthamoeba sp. MA47 (GenBank Accession EF050504), Acanthamoeba sp. MA43 (GenBank Accession EF050502) and Acanthamoeba

Open Archive Toulouse Archive Ouverte · 200781 (GenBank: EU430656), Viola odorata isolate ODO178 V11 (GenBank: EU413922), isolate ODO214 V8 (GenBank: EU413919), isolate ODO182 V3

Getting Sequences from GenBank using R-packagesjcsantosresearch.org/Class_2014_Spring_Comparative/pdf/week_2/J… · Getting Sequences from GenBank using R-packages • Open R and

GenBank · GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is part of the International Nucleotide Sequence

Bioinformatics Overview, NCBI & GenBank

Asteraceae (Compositae) Genome Resources at NCBI GenBank

GenBank: West Nile Virus collaboration network overview

Dr Ilene Mizrachi - GenBank BarSTool

Jet Engines - Study Mafiajet engine is meticulously designed to hoover up huge amounts of air and burn it with vast amounts of fuel (roughly in the ratio 50 parts air to one part fuel),

GenBank - bioinf.ibun.unal.edu.cobioinf.ibun.unal.edu.co/cbib/estudiantes/1-07/expoEst/genbank.pdf · GenBank y sus colaboradores reciben secuencias genéticas producidas en laboratorios

How to Cook Huge Amounts of Practically Anything In A Catholic ...

INFORMATION TECHNOLOGY. HARDWARE 1.MAINFRAME COMPUTERS Large computers that process huge amounts of info for a firm quickly However, they are expensive.

Computational Analysis of Transcript Identification Using GenBank

Elizabeth Chen, PhD - Towards Structuring Unstructured GenBank

HOMEWORK 9ATen Terrible Tectonic Events...1816 was named 'the Year without a summer'. As huge amounts of ash were As huge amounts of ash were ejected into the high atmosphere from

Biologists have accumulated huge amounts of information ... · Web viewTitle Biologists have accumulated huge amounts of information about living organisms, and it would be easy to