Sharing re-usable phylogenetic data: we're not there yet

15
Sharing reusable phylogenetic data: we're not there yet Ross Mounce @rmounce http://orcid.org/0000-0002-3520-2046

description

My talk given at TDWG (Florence, Italy), 9am 31st October 2013

Transcript of Sharing re-usable phylogenetic data: we're not there yet

Page 1: Sharing re-usable phylogenetic data: we're not there yet

Sharing reusable phylogenetic data: we're not there yet

Ross Mounce

@rmouncehttp://orcid.org/0000-0002-3520-2046

Page 2: Sharing re-usable phylogenetic data: we're not there yet

A talk of two halves

1.) Outlining the extent of the problem

(lack of) sharing, standards, care (?)

2.) What I'm trying to do about it:

Digging data out of PDFs

Re-releasing as

Page 3: Sharing re-usable phylogenetic data: we're not there yet

Just ~4% of published phylogenetic studies in 2010publicly archived their supporting phylo data in

Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie E, Kumar S, Rosauer D, & Vos R. 2012 Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis

BMC Research Notes 10.1186/1756-0500-5-574

Where's the data?

Check our data yourself on Dryad here: 10.5061/dryad.h6pf365t

Page 4: Sharing re-usable phylogenetic data: we're not there yet

Scientists cannot be relied upon to share published data upon request

This has been known for a while nowe.g. (in Psychology) Wicherts et al 2006

But has been confirmed to be true for phylogenetics too:

Drew et al 2013 'Lost Branches in the Tree of Life'

report that just ~16% of researchers contacted supplied

the requested ('published') phylo data.

My own experience tallies with this – I soon stopped bothering to try and ask people via email for a copy of their published data. It's a waste of time.

Page 5: Sharing re-usable phylogenetic data: we're not there yet

The (Single) Supplementary Data Filewas a Y2K solution – a dump

ResearchData

Many legacy journal supplementary data systems bury data and leave it there to decompose

Often not re-usable in form e.g. a lazy PDF

Sometimes 'typeset', corrupting the data

A jumble of words & data where the bit you want is on page 92 (no programmatic access)

BURIED and really not very discoverable

Do reviewers even look at it? I think not tbh

Page 6: Sharing re-usable phylogenetic data: we're not there yet

I wasted too much of my PhD trying to get usable data to re-analyze

This is what I felt like... So I tried to do something about it...

www.supportpalaeodatarchiving.co.uk

An open letter in support of palaeontology data archiving

Which was picked-up by Nature NewsWhich, in turn got me in touch with:

Page 7: Sharing re-usable phylogenetic data: we're not there yet

Part 2

Since few will help you to re-use their data

You've got to dig it out and

make it re-usable yourself

ANDre-release it openly

so no-one else wastes their time doing this

Page 8: Sharing re-usable phylogenetic data: we're not there yet

It's not just phylogenetics.

I learned from the Open Knowledge Conference (Berlin 2011)that a lot different academic fields seem also struggle to make re-usable published data available.

If it's a common, shared-problem... why not seek a shared, cross-disciplinary solution?

Page 9: Sharing re-usable phylogenetic data: we're not there yet

AMI (Amanuensis)

Building upon tools first developed in computational chemistry by the Murray-Rust lab

e.g.

ChemicalTagger → PhyloTagger (Entity tagging)(Chem)PubCrawler → (Phylo)PubCrawler

(to getting 10,000+ PDFs to work on)

https://bitbucket.org/nickday/pub-crawlerhttp://www-ucc.ch.cam.ac.uk/products/software/chemicaltagger Open Source

Page 10: Sharing re-usable phylogenetic data: we're not there yet

BBSRC grant approved

“PLUTo: Phyloinformatic Literature Unlocking Tools”

Software for making published phyloinformatic data discoverable, open, and reusable

...I just need to get my PhD viva done & rubber-stamped

Instructions for getting the current working setup here:(multiple repositories, dependencies & requirements!)

http://rossmounce.co.uk/2013/10/06/setting-up-ami2-on-windows/

Page 11: Sharing re-usable phylogenetic data: we're not there yet

Evolution of ultraviolet vision in the largest avian radiation - the passerines Anders Ödeen 1* , Olle Håstad 2,3 and Per Alström 4

PDF

HTML

Styles , superscriptsAnd diåcritics preserved!

AMI

Page 12: Sharing re-usable phylogenetic data: we're not there yet

PDF

Turdus iliacusTaeniopygia guttataSerinus canariaLanius excubitorMelopsittacus undulatusPavo cristatusSturnus vulgarisDolichonyx oryzivorusFicedula hypoleucaVaccinium myrtillusFalco tinnunculus

TurdusPomatostomus LeothrixAmytornis AcanthisittaOrthonyx x 2MalurusCnemophilus x 4Philesturnus x 2Motacilla x 2Toxorhampus x 2

Page 13: Sharing re-usable phylogenetic data: we're not there yet

Typical phylo tree: 60 nodes, complex and miniscule annotation, vertical text, hyphenation and valuable branch lengths. AMI extracts ALL

Page 14: Sharing re-usable phylogenetic data: we're not there yet

Acanthisittidae Acanthizidae Acrocephalidae Callaeidae Campephagidae Cnemophilidae Corvidae

0.84 0.91 0.93 0.95

Acanthisitta Acrocephalus Ailuroedus Ailuroedus Amytornis Camptostoma

AMI23.1234.5437.2138.55

Posteriorprobability

Branch lengths

NexML

Genus Family

HTML

Page 15: Sharing re-usable phylogenetic data: we're not there yet

Acknowledgements & Thanks

For travel & accommodation support, without which I couldn't possibly attend TDWG

For the Panton Fellowship,inspiration and support

To the organisersof both the session:Nico, Hilmar, Rutgerand the conferenceas a whole!

My main collaborators on PLUTo: Matthew Wills and Peter Murray-Rust