2013 02 data portal science group update -v smith

23
data.nhm.ac.uk NHM data portal update Part of the informatics initiative (2013-15) Vince Smith & Ben Scott

Transcript of 2013 02 data portal science group update -v smith

Page 1: 2013 02 data portal science group update -v smith

data.nhm.ac.ukNHM data portal update

Part of the informatics initiative (2013-15)

Vince Smith & Ben Scott

Page 2: 2013 02 data portal science group update -v smith

The problem – research data Hard to find, access, cite and integrate

• 45 available online(4 print only or behind pay walls)

• 9 had supplementary data files• 39 papers with tables, charts & other data

o>1000 sequenceso826 figureso76 tableso1 genome

• No collective view of these data (37 journals)• No consistent way of citing NHM data• No mechanism to integrate or version• No way to repurpose data (retyping?)

49 NHM science group papers in last 4 weeksData via Carolyn Lowry e-mail, 13th Feb. 2013

Page 3: 2013 02 data portal science group update -v smith

The problem – collections data

Initial problems•Don’t know / can’t find the website

Hard to find, access, cite and integrate

Page 4: 2013 02 data portal science group update -v smith

The problem – collections data

Initial problems•Don’t know / can’t find the website

Botany http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=32Entomology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=40Library http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=36Mineralogy http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=55Palaeontology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=34Zoology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=38

Hard to find, access, cite and integrate

Page 5: 2013 02 data portal science group update -v smith

The problem – collections data

Initial problems•Don’t know / can’t find the website•6 different data collections

Hard to find, access, cite and integrate

Page 6: 2013 02 data portal science group update -v smith

The problem – collections data

Initial problems•Don’t know / can’t find the website•6 different data collections•23 interfaces & datasets of varying importance

Hard to find, access, cite and integrate

Page 7: 2013 02 data portal science group update -v smith

The problem – collections data

Initial problems•Don’t know / can’t find the website•6 different data collections•23 interfaces & datasets of varying importance•No priority to collection datasets

119 Specimens Up to 28,000,000 Specimens

Hard to find, access, cite and integrate

Page 8: 2013 02 data portal science group update -v smith

The problem – collections data

Initial problems•Don’t know / can’t find the website•6 different data collections•23 interfaces & datasets of varying importance•No priority to collection datasets•Entomology collections don’t exist (404)

Hard to find, access, cite and integrate

Page 9: 2013 02 data portal science group update -v smith

The problem – collections data

Initial problems•Don’t know / can’t find the website•6 different collections•23 interfaces & datasets of varying importance•No priority to collection datasets•Entomology collections don’t exist (404)•Library doesn’t have any online collections!

Hard to find, access, cite and integrate

Page 10: 2013 02 data portal science group update -v smith

The problem – collections data

Initial problems•Don’t know / can’t find the website•6 different collections•23 interfaces & datasets of varying importance•No priority to collection datasets•Entomology collections don’t exist (404)•Library doesn’t have any online collections!

Bigger issues•Idiosyncratic browse or search

Hard to find, access, cite and integrate

Page 11: 2013 02 data portal science group update -v smith

The problem – collections data

Initial problems•Don’t know / can’t find the website•6 different collections•23 interfaces & datasets of varying importance•No priority to collection datasets•Entomology collections don’t exist (404)•Library doesn’t have any online collections!

Bigger issues•Idiosyncratic browse or search•No maps, few images & very slow•No summary or statistics•No download, export or custom views•No integration with other data•No author info or update info•No means of specimen citation•No exports to GBIF or associated projects

Hard to find, access, cite and integrate

The data portal must correct these issues

Page 12: 2013 02 data portal science group update -v smith

The solution – data.nhm.ac.uk portal High level issues

Functional requirements•A central access point for NHM research & collections data•The capacity store/link and describe datasets•Integrated search & browse of datasets•The ability to cite datasets and specimen records in data sets•The ability to integrate collections data•Custom functions for sub-sections of data (e.g. initiatives, Virtual Herbarium)•The capacity to download, export & analyse data

Principles•Open-by-default: Capacity for embargoed and private data•Sustainable: Self-populated by NHM staff (except collections data)

Exclusions•Not a replacement for DAMS or KeEMu (a Web interface for these systems)•Publications out of scope (focused on data sets)•All annotations on data link back to the source (e.g. KeEMu)

Page 13: 2013 02 data portal science group update -v smith

The solution – data.nhm.ac.uk portal System Overview

Scope(Source Data)

KeEMu (NHM)

HerbCat (Kew)

Other datasetsSpecies dictionary,

initiatives, Scratchpads etc

User contributeddatasets

DwC-APhyloXML

neXMLNexus

Excel, CSVetc…

File types(formats)

Map view Table view Statistics view Analytic view

Explorer

Registry(Discovery & download)

NHM specimens

Kew specimens

Other

Private

Subportals(Branded slices of data)

Subportal 1e.g. Disease

initiative

Subportal 2e.g. Kew / NHM

Virtual Herbarium

Page 14: 2013 02 data portal science group update -v smith

Portal overview – adding data setsQuick & easy, semi-automated workflow

1. Name the dataset 2. Upload / link

the data file

3. Describe the data file

4. Theme & tag

5. Add additional resources

6. Temporal coverage

7. Geographic coverage

8. Save & finish

Page 15: 2013 02 data portal science group update -v smith

Portal overview – search interfaceDiscovering research data sets

Search

Datasets matching criteria

Individual dataset

Results

Browse & searchcriteria

Advanced display options

Page 16: 2013 02 data portal science group update -v smith

Portal overview – data set displayExploring research data sets

Metadata about the dataset

Name

Geographic scope

Tags

“Social”

Authors

License

Download

Developer tools

TechnicalInfo.

(extracted from data

file)

Page 17: 2013 02 data portal science group update -v smith

Portal overview – collections dataMain interface

Zoomable map

Applied filters

Toggle map, table & stats views

Search, download & display options

No. records

No. Georef. records

Page 18: 2013 02 data portal science group update -v smith

Additional interfaces

Collections views

Statistical summary

Specimen record views

Data field mappings

Summary preview

Full record

Tables

Download

Portal overview – collections data

Page 19: 2013 02 data portal science group update -v smith

Portal overviewSome example data portals & software

Data.gov & CKAN•UK government data portal•Uses CKAN, open-source data portal platform•Used by national & regional governments•Links into Drupal, DataCite & NHM systems•http://data.gov.uk & http://ckan.org/

Canadensys & CartoDB•Canadian network of biodiversity collections•Almost 1 million specimens, 18 datasets•Uses CartoDB mapping solution•Create dynamic maps, analyze and build location aware and geospatial applications•Widely used, cloud data storage, PostGIS•http://data.canadensys.net & http://cartodb.com/

Page 20: 2013 02 data portal science group update -v smith

Portal developmentTimeline & resources

Year 1 – Dataset discovery•Technical & functional specification (Vizz. subcontract)•Data workflows (KeEMu & research datasets)•Functional alpha prototype (CKAN)

Year 3 – Citation & analysis•DataCite DOIs on datasets & specimens•Initial Web analytical functions (R)•Initiative sub-portals including Virt. Herbarium

Year 2 – Visualisation•Mapping & statistical functionality (CartoDB)•Social and annotation functions•Stable beta release at http://data.nhm.ac.uk

Resources•1x Developer (Ben Scott) for 3 years•Vizzuality subcontract (circa £xxk - TBC)•ICT capital, travel & software (circa £25k)

Page 21: 2013 02 data portal science group update -v smith

Portal consultationFeedback & next steps

Initial stakeholder meetings (Feb. – May)•ICT Group (David Thomas, Chris Sleep & Gavin Malarky)•Darrell Siebert and the KE EMu user group•NHM Collections Committee & Initiative leaders •Kew Gardens & Virtual Herbarium Reps.•GBIF, NBN, UK DataCite team at BL, NERC •Digital Facility Team •Vizzuality

Wider consultation•Example data types / sets•Specialist search options & vocabularies•Specialist Earth Science needs

Documentation•Overview specification - http://goo.gl/qjioh•Project Initiation Document - http://goo.gl/oRr2j

FEEDBACK & LINKS

Slides: Feedback: [email protected]: http://goo.gl/qjiohPID: http://goo.gl/oRr2j

Page 22: 2013 02 data portal science group update -v smith

Two more things

Wikipedian in Residence•Four month post with Science Museum•Starting March / April•Work with NHM staff to improve Wikipedia•Run events with NHM staff & volunteers•Work with the GLAM group at Imperial College•Focus on NHM science themes & specimens•Not about promotion of “The NHM”

Biodiversity Informatics Workshop – May 2013•One full day - date TBC•Outputs from ViBRANT & e-Monocot •Includes Scratchpads & the Biodiversity Data Journal•What we do, how its used and where are we going•Includes links to NHM informatics & digitisation initiatives

Page 23: 2013 02 data portal science group update -v smith

Portal overview – data citationUnique identifiers for datasets & specimen records

Why cite data•URLs are not persistent•e.g. Wren JD: URL decay in MEDLINE- a 4-year follow-up study. Bioinformatics. 2008, Jun 1;24(11):1381-5) – circa 40% decay

•Measure our digital footprint•Puts research data on par with articles•Facilitates data mining

How to cite data•Digital Object Identifiers (DOIs)•Widely used & understood on articles•Operates in collaboration with DataCite•Part of an International consortium•Mixes NHM data with other domains

What gets an identifier•NHM specimen records (suffix of NHM ID’s)•NHM research datasets (files)•Insert into publications

http://dx.doi.org/BMNH_PBI_00388325