BRIF workshop Toulouse 2012 Digital IDs subgroup

21
BRIF Digital identifiers subgroup -- Overview -- Brief backgrounder on identification & digital identifiers Use cases for bio-resource identification in BRIF Digital resources: datasets, databases (Mummi) Non-digital resources: projects, studies, cohorts [...] (Pierre) Conclusions and next steps This work is published under the Creative Commons Attribution license (CC BY: http://creativecommons.org/licenses/by/3.0/ ) which means that it can be freely copied, redistributed and adapted, as long as proper attribution is given. Gudmundur A. Thorisson <[email protected] > GEN2PHEN / University of Leicester Pierre-Antoine Gourraud <[email protected] > UCSF Monday, 22 October 12

Transcript of BRIF workshop Toulouse 2012 Digital IDs subgroup

Page 1: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF Digital identifiers subgroup

-- Overview --

‣Brief backgrounder on identification & digital identifiers

‣Use cases for bio-resource identification in BRIF‣Digital resources: datasets, databases (Mummi)

‣Non-digital resources: projects, studies, cohorts [...] (Pierre)

‣Conclusions and next steps

This work is published under the Creative Commons Attribution license (CC BY: http://creativecommons.org/licenses/by/3.0/) which means that it can be freely copied, redistributed and adapted, as long as proper attribution is given.

Gudmundur A. Thorisson <[email protected]> GEN2PHEN / University of LeicesterPierre-Antoine Gourraud <[email protected]> UCSF

Monday, 22 October 12

Page 2: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

BRIF and bio-resource identification

• The identification requirement: need to identify resources in order to– track use/reuse and impact

– credit those who contribute to them

• Biobanking projects have relied on:– Project/study/cohort names

• Example: the GAZEL study in France >20 years http://www.gazel.inserm.fr • Challenges: - ad hoc agreements with research groups who reuse samples or data

- painstaking manual searching through literature for mentions of ‘GAZEL‘ - project names are often ambiguous in global context

Monday, 22 October 12

Page 3: BRIF workshop Toulouse 2012 Digital IDs subgroup

Monday, 22 October 12

Page 4: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

BRIF and bio-resource identification

• The identification requirement: need to identify resources in order to– track use/reuse and impact

– credit those who contribute to them

• Example: biobanking projects frequently rely on...– Project/study/cohort names

• Example: the GAZEL study in France >20 years http://www.gazel.inserm.fr • Challenges: - ad hoc agreements with research groups who reuse samples or data

- painstaking manual searching through literature for mentions of ‘GAZEL‘ - project names are often ambiguous in global context

– Citations to journal publications• Which paper to cite? Tricky to keep track of which citations are relevant to impact • Also troublesome if there is no paper to cite (e.g. for a new study)

Monday, 22 October 12

Page 5: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

Digital identifiers - some background

• Definition: a digital identifier is a character string used to uniquely identify i) a digital object in a computer system, or ii) a record in a computer system which describes a non-digital object

• Persistence - once assigned, identifier MUST NOT change• Uniqueness - global scope vs local scope

– Most ID schemes require tacid knowledge of the type of identifier to interpret• Example: EC grant identifiers in acknowledgement statements

Monday, 22 October 12

Page 6: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

This work has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 200754 - the GEN2PHEN project.

Monday, 22 October 12

Page 7: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

This work has received funding under grant agreement number 200754

Monday, 22 October 12

Page 8: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

Digital identifiers - some background

• Definition: a digital identifier is a character string used to uniquely identify i) a digital object in a computer system, or ii) a record in a computer system which describes a non-digital object

• Persistence - once assigned, identifier MUST NOT change• Uniqueness - global scope vs local scope

– Most ID schemes require tacid knowledge of the type of identifier to interpret• Example: EC grant identifiers

• Some problem domains require for globally unique IDs– Example: ISBN numbers to identify books, e.g. for copyright purposes

• Some problem domains require resolvable IDs– Resolve = retrieve out information about the thing being identified, including where

to access it (for a digital object, its location on the Internet)– Digital Object IDs best known, but several other systems exist

Monday, 22 October 12

Page 9: BRIF workshop Toulouse 2012 Digital IDs subgroup

Monday, 22 October 12

Page 10: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

Identifier use cases in BRIF

• 3x broad categories of “stuff” to identify

i) Digital resourcesResources that actually “lives” in computers (born-digital or digitized content): datasets and databases

ii) Physical resourcesResources corresponding to actual physical things: samples, groups of samples, experimental instruments, etc.

iii) Project-level and other “meta” resourcesHigher-level aggregates of things, projects, organizations, consortia etc.

NB in many cases identifiers already exist for these things, but they are not exposed to the outside world in a usable form (i.e. made resolvable, citable, globally-unique).

Monday, 22 October 12

Page 11: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

Datasets

• Definition: a data set (or dataset) is a collection of data, often presented in tabular form but in the bio-sciences also frequently in a multitude of domain-specific formats, such as FASTA for biological sequences

• Data publication and data citation is a hot topic - lots of research and infrastructure-building activity in recent years

• Emerging best practices for data citation & attribution• Identifiers for dataset - persistent data DOIs issued via DataCite

• Little new for BRIF to add here, except issue recommendations– KEY POINT: infrastructure for data preservation and access is a prerequisite for any

sort of persistent bio-dataset identification scheme. Many projects don’t have this!

Monday, 22 October 12

Page 12: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

Data DOI scenario (simplified)

1. Research group registers a dataset and metadata in a suitable domain repository (or their own repository)

2. Repository archives dataset and and assigns a DOI name to it

3. Unique DOI name is used by article authors (and others) to indicate resource reuse (ideally via formal data citation)

4. Journal article reference listings & full-text and other sources are mined to identify references to dataset and/or downloads

5. Dataset-level metrics calculated from collected datae.g. - total no. citations in scholarly articles - no. secondary citations (citations to papers which cited the original dataset) - no. downloads in the last 2 years

Monday, 22 October 12

Page 13: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

ORCID and DataCite Interoperability Network

• Persistent identifiers for connecting people and dataset

• 2y EC-funded project, 7 partners in Europe + USA• Two main proof-of-concept pilots

– Social Science data - use and citation of British Birth Cohort Studies

• historical data, decades old, steadily being curated by lots of different people

• high rate of reuse, often cited in papers

– High-energy physics - attribution challenges• dealing with large no. authors on HEP papers - ‘dilution’ of the term

authorship• Linking HEP papers to supporting datasets

http://odin-project.eu/

Monday, 22 October 12

Page 14: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

Databases• Definition: an online database can be regarded as a collection of

data, but made accessible in such a way that facilitates using the data to answer scientific question, via  structured querying and/or free-text searching of the data over the Internet

• Broad range, from large-scale DNA and protein sequence repositories to small locus-specific databaess– E.g. GenBank, UniProt, GWAS Central, Ehlers-Danlos Syndrome Variant Database

• Challenges in assessing impact & attributing curators– Reliance citations to database paper, if there is one (sometimes many)

• Analyzing website traffic is another indicator - highly-accessed database =~ important

– Database URLs sometimes change– Database name + URL often only mentioned only in materials&methods, no citation

– Credit via authorship impossible if there is no database journal paper

Monday, 22 October 12

Page 15: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

BioDBCore - global catalogue of bio-db’s• BioDBCore aims

– annotation - organize the bio-database ‘resourceome’

– discovery - e.g. which protein sequence databases are available?

• Who’s behind it?– International Society for Biocuration– Resource catalogues: Bioinformatics Links,

BioSiteMaps, NAR db-issue etc – Working group includes reps from NAR and

DATABASE journals, MIBBI, Model organism db’s, others

• Catalogue will have persistent identifiers for each db entry

http://www.biosharing.org/biodbcore

Monday, 22 October 12

Page 16: BRIF workshop Toulouse 2012 Digital IDs subgroup

Monday, 22 October 12

Page 17: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

•[slot in Pierre]

Monday, 22 October 12

Page 18: BRIF workshop Toulouse 2012 Digital IDs subgroup

From  Pa(ents  to  BioBanks  and  back…

• Persistent  IDs  for  datasets  &  other  digital  resources–Absolute  need

• From  BioresourceResearchIF  to  BioresourceXIF–More  than  an  IP  address  ?  

• Increase  need  of  iden<fica<on  for  source  of  informa<on  in  general  –  Not  only  research  purpose…– “Big  data”  –Quan<fied  self.

• Blurring  the  border  between  :  Research,  data  (Non-­‐CLIA),    Clinically  approved  ,  consumer  centered  data

Monday, 22 October 12

Page 19: BRIF workshop Toulouse 2012 Digital IDs subgroup

Database  Gateway    &  Computa1ons

Reference  groups  of  pa.entsIndividual  data

User  data Imaging

Front-­‐end  tablet  

Applica1on

Copyright  ©  2012  The  Regents  of  University  California,  USA  -­‐  All  right  reserved.  Monday, 22 October 12

Page 20: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

Conclusions / next steps• Complex landscape, lots of problems to tackle• Key challenge will be to get authors to use the right identifiers

– education, awareness, best practices, journal guidelines etc.

– build support into tools that researchers use

• Potential outputs from BRIF subgroup, by end of GEN2PHEN– Continue work on whitepaper on identifiers (partial drafted earlier in the year)– Compile recommendations for authors & biobankers, for use cases where workable

solutions exist or are emerging (data DOIs, BioDBCore)

• Need some biobanker-expert help in ID subgroup!– Esp. to look in-depth into study catalogues with established identifier schemes

• International Clinical Trials Registry Platform

• ClinicalTrials.gov • P3G study catalogue

Monday, 22 October 12

Page 21: BRIF workshop Toulouse 2012 Digital IDs subgroup

BRIF workshop, Toulouse Oct 22 2012

Acknowledgements GEN2PHEN Consortium

http://www.gen2phen.org/about-gen2phen/partners

Prof Anthony J. Brookes Bioinformatics Group, Leicester

This work has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013)under grant agreement number 200754 - the GEN2PHEN project.

Contact me!

<[email protected]> |<[email protected]>http://www.linkedin.com/in/mummihttp://www.twitter.com/gthorisson

http://www.gthorisson.namePublished under the CC BY license (http://creativecommons.org/licenses/by/3.0/)

Monday, 22 October 12