We live in an age of bioinformatics data glut . . .
description
Transcript of We live in an age of bioinformatics data glut . . .
David Shotton
Image BioInformatics Research GroupDepartment of Zoology
University of Oxford, UK
http:/ibrg.zoo.ox.ac.uk
Doing more with less: data sharing and integration in an age of data glut and economic contraction
© David Shotton, 2010 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence
e-mail: [email protected]
Dryad-UK Discussion MeetingHEFCE Offices, Centre Point, London
27-28 April 2010
We live in an age of bioinformatics data glut . . .
Attwood TK et al. (2009) Calling International Rescue: knowledge lost in literature and data landslide! Biochemical Journal 424:317–333.
Nucleic Acids Research Database Collection
0
200
400
600
800
1000
1200
1400
2003 2004 2005 2006 2007 2008 2009 2010
Year
Nu
mb
er
of
bio
info
rma
tic
s d
ata
ba
se
s
Cochrane GR, Galperin MY (2010) Nucleic Acids Research 38:D1-D4
There are now over 1200 bioinformatics databases, between which data integration is difficult
Data integration for many researchers amounts to nothing more sophisticated than cutting and pasting into a Word document !!
Research data – Universals and Particulars
Gene sequences and protein structures represent ‘universal truths’
The data need only be discovered once
The data are intinsically simple and form bounded data sets
Data are cheap per bit, and re-acquisition is becoming cheaper
Public databases exist for these data (GenBank, PDB, etc.)
The whole of bioinformatics is build on their free availability
Life science research data can also be ‘particulars’, for example individual assay results, disease reports, observations, electron micrographs, videos
These data are heterogeneous and form unbounded data sets – typical of ‘long tail’ science rather than ‘big science’
Data collection is costly in human resources, and re-acquisitionmay be impossible, e.g. for observational data
Datasets thus often have a high intrinsic value per bit
The majority of such research datasets are never published,rotting on the abandoned hard drives of departed postdocs
In this open access age, that is little short of scandalous!
The problems of obtaining infectious disease data
Quote from Professor Angela McLean, after taking three months to amass appropriate disease incidence data for her 2007 J. Virology paper on HIV escape mutations:
“When I was a graduate student, I spent long hours in libraries copying numbers from dusty journals.
“Things have not improved much since ! ”
I have a particular concern about infectious disease data, since I believe that timely availability of reliable data in this domain may have an important impact on global health.
The benefits and risks of published data
The benefits of open data publication:
review and validation by others,
re-use in another contexts, and
integration with other data to create a new greater whole
Governments, funding agencies, publishers and researchers agree that the results of publicly funded research should be made publicly available
The problems of sharing data (From RIN – BL Report, November 2009 Patterns of Information Use and Exchange: Case Studies in the Life Sciences)
Ethical constraints and IPR issues
Concerns about misuse and data ownership
“As researchers, we see data as a critical part of our ‘intellectual capital’, generated by investment of time, effort and skill.”
Lack of personal attribution and credit for data publication
Difficulties in creating appropriate metadata
Appropriate repositories to archive and publish research datasets
Semantic publishing of structured research datasets
Semantic publishing is the use of simple Semantic Web technologies:
to enhance the meaning of on-line published research articles
to provide access to the articles’ published data in actionable form
to facilitate the integration of semantically related data
so that data, information and knowledge can more easily be found, extracted, combined and reused
For research datasets to be maximally useful, they have to be:
saved in machine-processable form, in conformity with appropriate Web standards (e.g. XML, RDF, OWL)
published and made freely accessible on the Web
referenced by globally unique and resolvable identifiers (e.g. DOIs)
accompanied by useful metadata based upon minimal information standards and ontologies, including provenance information
Features of the original PLoS NTD article, relating to data
Good
The article contained a rich variety of data types (geospatial, disease incidence, serological assay, and questionnaire) presented in formats amenable to semantic enrichments (maps, bar charts, tablesand graphs)
Poor
While figures and table can be downloaded, they can only be so as images !
The numerical data are not directly available in actionable form
http://dx.doi.org/10.1371/journal.pntd.0000228.x001
http://dx.doi.org/10.1371/journal.pcbi.1000361
Drosophila gene expression data exists in many databases
FlyAtlas
Data from four sources combined in an OpenFlyData window
Query for schuy over cached RDF data from FlyTED, BDGP, FlyAtlas and FlyBase
http://openflydata.org/
In conclusion:
data publishing and global warming
Waiting for some international committee in Copenhagen to create the perfect solution to the data publication problem is not the way forward
Just as we can each act locally to reduce our carbon footprint,
so we can each do something personally to increase our data footprint
Each of us, whether researcher, publisher or government agency, can take responsibility for the open publication of our own research data
The important thing is to make a start !
end
Advantages of repository over supplementary data files
Dryad Suppl
Searchable: published metadata allows Google search for data files Confirmable: author can confirm descriptive metadata terms used Citable: unique identifiers (DOIs) permit citation of data files ?
Increased exposure of source journal articles through data citation ?
Permanent: data files securely archived in perpetuity ? ?Linked: datasets linked to article based on them Metadata will be available as RDF: part of the “web of linked data” ?Curated: quality verified, stable formats used, content virus-checked ?Ease of deposit: authors can upload multiple or zipped files ?Updatable: new versions of data files can be added, with provenance Embargo: can delay release of data up to one year after publication Open access: no restrictions for users, no subscription required ?Scalable: many journals and societies can leverage economies of scale
Convergence between journals and databases
PLoS Comp. Biol. 2005 1(3) e34
In this paper, Philip Bourne, Editor-in Chief of PLoS Computational Biology and Co-Director of the Protein Data Bank, contends that the distinction between an on-line journal and an on-line database is diminishing
He calls for “seamless integration” between papers reporting results and the data used to compute those results
My critique of Philip Bourne’s ideas
We need to maintain a clear distinction between journal publications:
peer reviewed
immutable dated ‘versions of record’ – part of the history of science –
that provide the citable authorities for research datasets
and research databases:
that should present user with access to complete, impartial, up-to-date datasets, both for further exploration and automated data mining
with curators responsible for correction of errors after submission
Thus “seamless integration” is not desirable
Articles are rhetorical
Datasets are analytical
Researchers require the “seams” to be kept clearly visible, so they know which presuppositional spectacles to wear when reading
Nevertheless, both frictionless interoperability and reciprocal citation between papers and datasets are highly desirable