Data Standards for Data Integration - NIH Common Fund · PDF filesoftware classification Data...

16
caspase activity PubChem nomenclature fluorescence viability cheminformatics semantic domain binding based programming knowledge search technology end point thesauri article object PDSP XML enzyme substrate based versioning natural language high-thoughput screening (HTS) software classification polysemes Data Standards for Data Integration biological pathways Beta-Lactamase Induction dehydrogenase activity specificity subject headings cyclic AMP redistribution RDF calcium redistribution OWL novel chemical tools library pharmaceutical individuals tags semantic web structural biology Stephan Schürer indexing homographs energy transfer authorized terms data sets screening chemical biology activity enzyme reporter standards servers search tool small molecule biological assay mical probes ChemBank disease networks biomedical knowledge Fluorogenic substrate GFP induction controlled vocabulary subject indexing schemes taxonomies synonyms concepts structure meta-data information exchange properties annotation clas che ATP Luci ses ferin Coupled SPARC Workshop, NIH, Feb 25-26 2015 [email protected]

Transcript of Data Standards for Data Integration - NIH Common Fund · PDF filesoftware classification Data...

  • caspase activity PubChem nomenclature

    fluorescence viability

    cheminformatics semantic

    domain

    binding based programming knowledge search technology end point

    thesauri article object PDSP XML enzyme substrate based versioning natural language high-thoughput screening (HTS) software classification polysemes Data Standards for Data Integration

    biological pathways Beta-Lactamase Induction dehydrogenase activity specificity

    subject headings cyclic AMP redistribution RDF calcium redistribution OWL novel chemical tools

    library pharmaceutical

    individuals tags

    semantic web structural biology

    Stephan Schrer University of Miami

    indexing

    homographs energy transfer

    authorized terms

    data sets screening

    chemical biology

    activity enzyme reporter

    standards

    servers search tool

    small molecule

    biological assay

    mical probes ChemBank

    disease networks biomedical knowledge

    Fluorogenic substrate GFP induction

    controlled vocabulary

    subject indexing schemes

    taxonomies

    synonyms

    concepts structure meta-data

    information exchange properties annotation

    clas

    cheATP Lucises ferin Coupled

    SPARC Workshop, NIH, Feb 25-26 2015

    [email protected]

  • Types of data standards Reporting guideline (checklist) specifies what

    information need to be captured about an experiment for a particular purpose

    (Controlled) vocabulary terminological resource that provides the identification and definition of entities

    Data exchange format is a specification how data are encoded to be computer-readable / -processable

    Data structure refers to organization of data, data schema, entity relations

    2

  • O. Search all of BloSharlng LOG IN OR REGISTER

    biosharing O

    COMMUNITY CONTENT SUMMARY

    Contribute by submitting a standard Found a bug? Please tell us!

    Standards

    BioSharing standards have been partly compiled by linking to BioPortal, MIBBI and the Equator Network.

    Or you can filter on MIBBI Foundry reporting guidelines or 080 Foundry terminology artifacts.

    X REPORTING GUIDELINE

    View as Grid View as Table

    D No Publlc1tlon [J Hl!IS Publication

    ~ No M1lnt1lner ~ Has M1lnulner

    Standard Type C'ear

    REPORTING GUIDELINE

    EXCHANGE FORMAT

    TERMINOLOGY ARTIFACT

    arch for tandar O. Search

    AMIS Article Minimum Information Standard

    REPORTING GUIOELINE

    ~Systems

    L.:., Publications a

    MHHF ~Reset

    Showing records 1 50 of 69.

    ARRIVE Animals in Research: Reporting In Vivo ..

    REPORTING GUIOELINE

    S Systems

    L.:., Publications a

    BioDBCore Core Attributes of Biological Databases

    REPORTING GUIOELINE

    ~Systems

    L.:., Publications

    a a

    ABOUT

    BioSharing Standards (http://www.biosharing.org)

    3

    http:http://www.biosharing.org

  • a mibbi Bioscience reporting guidelines and tools @ Portal

    @ Foundry

    @ About

    Minimum Information guidelines from diverse bioscience communities

    If you want to register your checklist to MIBBI, please contact the BioSharing team Excel spreadsheet and XML document (schema) describing all registered projects

    Bioscience projects registered with MIBBI

    CIMR Core Information for Metabolomics Reporting

    GIATE Guidelines for Information A bout Therapy Experiments

    MIABE Minimal Information About a Bioactive Entity

    MIABiE Minimum Information About. a Biofilm Experiment

    MIACA Minimal Information About a Cellular Assay

    MIAME Minimum Information About. a Microarray Experiment

    MIAPA Minimum Information About a Phylogenetic Analysis

    MIAPAR Minimum Information About. a Protein Affinity Reagent

    MIAPE Minimum Information About a Proteomics Experiment

    MIAPegAE Minimum Information About. a Peptide Array Experiment

    MIARE Minimum Information About a RNAi Experiment

    MIASE Minimum Information About. a Simulation Experiment

    MIASPPE Minimum Information About Sample Preparation for a Phosphoproteomics Experiment

    MIATA Minimum Information About. T Cell Assays

    MICEE Minimum Information about a Cardiac Electrophysiology Experiment

    MIDE Minimum Information required for a DMET Experiment

    MIFlowCyj Minimum Information for a Flow Cytometry Experiment

    I !. x

    I !. x

    I !. x

    I !. x I !. x

    I !. x

    Checklists Minimum Information Guidelines

    4

  • Minimum Information Standard may not exist

    Regenbase: Integration of diverse data related to nerve regeneration in the context of spinal cord injury

    http://regenbase.org 5

    http:http://regenbase.org

  • Vocabulary vs. ontology Controlled vocabularies / thesauri

    describe what things mean (link terms to human description)

    Entities with identity criteria Share knowledge in a common language Natural language synonyms for search and text mining

    6

  • Vocabulary vs. ontology Ontologies

    Contains entities (classes) and their relationships (object properties)

    Capture / abstract knowledge using logical axioms Explicit specification (OWL-DL)

    Building formal (computable) models Computing with knowledge (reasoning engines) Foundation of Semantic Web information systems

    7

  • Ontology resources

    NCBO Bioportal EBI Ontology Lookup Service OLS

    OBO Foundry

    8

  • Metadata specifications Metadata: Data not directly measured in an experiment (or obtained in a study)

    Why metadata: Facilitate data replicability, reproducibility, reuse Interpret results, perform data analysis, hypotheses Repurpose data for other projects Information systems (search, query, data integration

    and exchange)

    What metadata to capture in a standardized format with controlled vocabularies (and formal descriptions)?

    9

  • A useful distinction of metadata Model metadata: Required to understand, interpret, and meaningfully

    integrate experimental results Typically queryable in software systems Important parameters to describe conclusions (data

    visualizations)

    Confounder metadata: Non model metadata required to replicate and

    reproduce experimental results Needed for data forensics (e.g. batches of reagents,

    maintenance of experimental equipment, etc.) 10

  • Standardized metadata Capture all (detailed descriptions, SOP) Make model metadata explicit (controlled vocabulary, standard format)

    But whats really model metadata? Data and informatics use cases

    Types of queries and analyses Integration with other data sources Information systems / UI components Consider re-use of data for other projects

    LINCS Metadata Standards: Vempati et al J Biomol Screen 2014 11

  • Data Coordination

    12

    Repository

    V' '\/'l tw I

    Central Database I repository

    -

    -~ I

    Stch+flesulls+Nav ~

    lfoAv"J ID@

    =

    ~ t:~ ~~ @ ~Standards Computing Globalindex Data APls liiill!

    "-t:~-Repository

  • Data set IDs and provenance Permanent ID via (authoritaitve) repository or data publication

    DOI PURL

    Capture data provenance PROV-O: The PROV Ontology (W3C)

    http://www.w3.org/TR/prov-o/ PAV (Provenance, Authoring Versioning) Ontology

    http://purl.org/pav/

    13

    http://purl.org/pavhttp://www.w3.org/TR/prov-o

  • Provenance The link between source data, computation / processing and derived data / results static verifiable record track changes compare / discrepancies repeat / reproduce Citation version data release PDIFF, Woodman 2011

    14

  • RDBMS Ontologies Closed world assumption Open world assumption No reasoning support Reasoning support Need to know schema

    for highly specialized queries

    Provide restriction-free framework (formal semantics)

    Data sharing not easy, no semantics

    Easy data/knowledge sharing

    Efficient RDBMS access Triple store access Established technology Relatively early stage Industry standard Standards emerging

    RDBMS vs Semantic Web technologies

  • Replicability vs Reproducibility

    [repeat] same

    experiment same lab

    same experiment

    different set up

    [ rep1rod u ce]

    [replicate] same

    experiment different lab

    different experiment

    some of same

    reuse Drummond C Replicabil~y is not Re oducibility: Nor is it Good Science, online

    Peng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.

    Data Standards for Data IntegrationSlide Number 2Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Slide Number 9Slide Number 10Slide Number 11Slide Number 12Slide Number 13Slide Number 14RDBMS vs Semantic Web technologiesReplicability vs Reproducibility