Metadata Analyser: measuring metadata quality

22
Metadata Analyser: measuring metadata quality Bruno Inácio, João D. Ferreira, and Francisco M. Couto LaSIGE, Faculdade de Ciências da Universidade de Lisboa, Portugal PACBB, June 21-23, 2017 Porto Portugal

Transcript of Metadata Analyser: measuring metadata quality

Page 1: Metadata Analyser: measuring metadata quality

Metadata Analyser: measuring metadata quality

Bruno Inácio, João D. Ferreira, and Francisco M. Couto

LaSIGE, Faculdade de Ciências da Universidade de Lisboa, Portugal

PACBB, June 21-23, 2017

Porto Portugal

Page 2: Metadata Analyser: measuring metadata quality

Figure 1. Two pages (scan) from Galilei's Sidereus Nuncius (“The Starry Messenger” or “The Herald of the Stars”), Venice, 1610.

Goodman A, Pepe A, Blocker AW, Borgman CL, et al. (2014) Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003542

Galileo integrated • the direct results of

his observations of Jupiter

• with careful and clear descriptions of how they were performed

From “Big” Data to Knowledge

Page 3: Metadata Analyser: measuring metadata quality

<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc= "http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Sintra_Collar"> <dc:description> Gold collar. It was made from three circular sectioned and tapering gold bars that are fused at the ends forming a penannular neck-ring. </dc:description> <dc:date>1250BC-800BC (circa)</dc:date>

<dc:location> Sintra, Portugal http://yboss.yahooapis.com/geo/placefinder?woeid=748874 </dc:location> <dc:type> Gold http://purl.obolibrary.org/obo/CHEBI_30050 </dc:type> </rdf:Description> </rdf:RDF>

Page 4: Metadata Analyser: measuring metadata quality

Metal

Silver

Coinage Precious

Palladium Gold Platinum Copper

is-a

mappings

Page 5: Metadata Analyser: measuring metadata quality

Conventional Solution proper data sharing rules

• So let’s create some Data-sharing Policies and some Compliance and Enforcement activities

Page 6: Metadata Analyser: measuring metadata quality

Esperanto

• Created in 1887 as an easy-to-learn

• And politically neutral language

• But, English provides a greater incentive

– Websites Languages, March 2014

Page 7: Metadata Analyser: measuring metadata quality

Data-sharing policies

“Adherence to data-sharing policies is as inconsistent as the policies themselves” “351 papers covered by some data-sharing policy, only 143 fully adhered to that policy” (~40%) “is time-consuming to do properly, the reward systems aren't there and neither is the stick” “Of all the data that are made available, what fraction is actually used by someone else? “ Steven Wiley in Nature, 2011

http://www.nature.com/news/2011/110914/full/news.2011.536.html

Page 8: Metadata Analyser: measuring metadata quality

Human Factor

• “More often than scientists would like to admit, they cannot even recover the data associated with their own published works”

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003542

Page 9: Metadata Analyser: measuring metadata quality

Goals

1. propose two measures of metadata quality

2. to implement a tool that is able to evaluate these measures in a public repository

3. to show that these measures are valid and significant in a real-world scientific repository

Page 10: Metadata Analyser: measuring metadata quality

Measures of metadata quality

1. Term coverage

the proportion of annotations in the metadata file that link to an ontology concept

2. Semantic specificity the average specificity of those ontology concepts

Page 11: Metadata Analyser: measuring metadata quality

Term Coverage

• It is the ratio between

– the number of annotations that refer to ontology concepts

– and the total number of annotations in the metadata file

Page 12: Metadata Analyser: measuring metadata quality

Semantic specificity

• A(t) is the number of ascendant concepts up from t

• and D(t) is the average distance between t and all its leaf descendants

Page 13: Metadata Analyser: measuring metadata quality

Metadata Analyser Architecture

1. An interface layer that interacts with the user by requesting a metadata file, informing the user on the analysis progress, and outputting the result

2. An application layer that analyses the metadata file and evaluates the annotations found therein.

3. A data layer that holds the ontologies in local databases

4. A web API layer that connects the interface layer to the application layer, coded in commonly used web technologies

Page 14: Metadata Analyser: measuring metadata quality

Case Study: Metabolights

• a database of metabolomics experiments

• developed by the EBI since 2012

• Evaluation

– the measures on all the resources

– manually in a selection of resources

– metadata quality before and after a curation step by experts

Page 15: Metadata Analyser: measuring metadata quality
Page 16: Metadata Analyser: measuring metadata quality
Page 17: Metadata Analyser: measuring metadata quality
Page 18: Metadata Analyser: measuring metadata quality

Manual Evaluation

Lower coverage: not all ontologies used to annotate the resources were included in the local database

Page 19: Metadata Analyser: measuring metadata quality

pre- and post-curation analysis

Page 20: Metadata Analyser: measuring metadata quality

Human Factor

1. may not know the ontologies that contain the concepts they need

2. do not fully know the structure of the ontologies in order to perform annotation with the appropriate specific terms

3. lack the proper skills to carry on the annotation process because of the technical difficulties associated with this task

4. do not consider data sharing to be relevant 5. consider that the cost of ensuring proper

semantic integration outweighs the benefits

Page 21: Metadata Analyser: measuring metadata quality

Conclusions

• apparent correlation between specificity and coverage

• a weak term coverage (average of 0.25)

• two proposed measures can effectively measure the effort put into the semantic annotation of digital resources

• Metadata Analyser – a means to measure the quality of their metadata

– 10,000 times faster than the previous work

Page 22: Metadata Analyser: measuring metadata quality

Acknowledgments

• The EBI team in charge of the development and maintenance of metabolights for their support in this study.

Software:

https://github.com/lasigeBioTM/MetadataAnalyser