Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F...
-
Upload
bertha-parsons -
Category
Documents
-
view
218 -
download
2
Transcript of Why Add Semantics to your Data? Or: Eat your veggies; even a little bit is good for you For TBPT F2F...
Why Add Semantics to your Data?Or: Eat your veggies; even a little bit is
good for you
For TBPT F2F Nashville, TN
Sherri de Coronado
Dec 7, 2011
1
Eat Your Veggies
2
Semantics Topics
• Why add semantics to your data? Overview• Continuum of adding semantics• Use Cases
• Drivers for adding semantics• ‘New’ techniques for knowledge discovery with and
without prior annotation• Automated annotation and enrichment analysis• NLP• Linked Data
• Discussion of TBPT needs
Why add semantics?
• Adding semantics is substantial effort; worth the effort?• In your own closed environment, may not be worth the effort.
• Consensus and use of standards allow data to be collected systematically, correctly and re-usably.• Unambiguous representation of meaning of data• Human and computer readability• Provenance and Data Governance!• Difficult to add later
• Increases ability to report large volumes of data to appropriate recipients like IRB, sponsors, monitors
• Secondary use of data – Anticipate reuse• Computer readable semantics improve ability to link to related
information; lowers cost of secondary use, discovery, aggregation
Continuum of Adding Semantics
Post the DD, Schema, etc.
Standard Schema/ etc.
Standard terminology
Standard Terminology & Metadata
Examples Along the Continuum
• Publish the schema or Forms
• E.g. REDcap* forms– easy to use, easy to share, no semantics
• E.g. CAP Pathology Report forms (with schema/ values)
• Mapping or Conversion – Independent Data Models, mapping or equivalent required, can’t join tables
• E.g. C3D cdms sharing data with BTRIS (Biomedical Translational Research Information System)
• caDSR Metadata/ data from C3D converted to BTRIS warehouse schema and terminology
• E.g. Trial registration from one DB sharing data with CTRP
• Standard Terminology
• E.g. CDISC subsets – Shared terms and meaning for reporting
• Standard Terminology and Metadata
• E.g. SDTM, ISA Tab based (MageTab, NanoTab)
• E.g. Clinical forms in caDSR with DE, DEC, properties – DE components + permissible values versioned and tied to terminology
What can you do with Semantics?
caGrid, etc
CancerData
caGrid, etc
CancerData
LinkedCancerData Cloud
LinkedCancerData Cloud
Use Case: Secondary Use of Clinical Data - SHARP
Data Normalization through Structured Metadata and Terminology: eMERGE (project in SHARP area 4)
• EMR and Genomics – Goals: work towards standardizing phenotype information to help elucidate genotype/ phenotype relationships, leading way toward data for large scale population studies.
• eMERGE – data normalization through structured metadata and terminology• Multicenter study in multiple domains• Mapped phenotype data dictionaries from 5 network sites (using
caDSR, NCIt, SDTM, SNOMED)• Built Elemap, early version of tool to help with mapping DE and
value sets, a common need.
Second Approach - NLP
• cTAKES 1.2 -- Clinical Text Analysis and Knowledge Extraction System (cTAKES). October 20, 2011
• “The Mayo SHARP (SHARPn) Natural Language Processing (NLP) team is excited to announce an updated release of the Clinical Text Analysis and Knowledge Extraction System (cTAKES), cTAKESv1.2. cTAKES is a free and open source NLP system distributed by Mayo Clinic through Open Health Natural Language Processing (OHNLP) consortium which allows researchers to utilize clinical information stored in free text through NLP techniques.”
• Includes a new annotator (beta version), SideEffect, which extracts physician-asserted drug side effects from clinical notes.
• http://sourceforge.net/projects/ohnlp/files/icTAKES/
Use Case: Nanomaterials
• Interpretation of Multidisciplinary Data from Multiple Data Sources• Nanomaterial “space” is too large to synthesize and characterize all
possible materials• Would like to be able to predict chemical, physical, biological properties
of new materials• Improved nanomaterial safety and efficacy
• Need:• Annotation with controlled vocabulary
• Query across sources to find relevant data
• Search with controlled vocabulary terms
• Use semantic relationships to enrich data set retrieval
• Identify data sets that comply with a defined set of minimal characteristic measurements (Data QC)
Physic
ist
Clinician
Chemist
Biolo
gist
Engineer
Regulator
GovernmentEPA, NIOSH, NIST,
NIH, FDA, DOE, USPTO
Industry
AcademiaStanford, Harvard,
MIT/MGH, Emory/GT, OSU, etc.
SDOASTM, ISO,
HL7, CDISC, IEEE
Data Sources and End Users
Physic
ist
Clinician
Chemist
Biolo
gist
Engineer
Regulator
Identification and Composition
Naming, Composition, Surface Chemistry,
Synthesis, Impurities, etc.
Intrinsic Properties
Size Distribution, Shape, Surface Area, Porosity, Refractive
Index, etc.
Extrinsic Properties Agglomeration, Aggregation,
Stability/Degradation, Zeta Potential , Redox
Potential, Catalytic Activity, etc.
ToxicityCytotoxicity , Acute Toxicity , Chronic
Toxicity, Genotoxicity, PK/ADME,
Teratogenicity, etc.
Environmental FateTransport Properties, Biotic Degradability,
Abiotic Degradability, Bioaccumulation,
Degradation Product Toxicity
Types of Nanomaterial Data
Facilitating Nanomaterial-Based Drug Design
Animal Models
Cellular Models
Chemical Properties
Molecular Assays
Manufacturing Processes
Physical PropertiesNano-SAR*
Virtual Manufacturing
Virtual Nanomaterials
In Silico Studies
Adapted from ICR Nanotechnology WG
Manufacturing
Nanomaterials
Preclinical Studies
Clinical Studies
Quantitative Structure-Activity Relationships (SARs) to predict properties
Semantics: Possible Future Directions
• Automated annotations• Lots of un-structured data out there. What can you do?
(Examples)• Nigam Shah (Stanford) – Making sense of unstructured data
• Linked Data• Semantic Web approach
Making Sense of Un-Annotated Medical Data(Nigam Shah – creator of NCBO Annotator)http://www.bioontology.org/making-sense-of-unstructured-data-in-medicine
• Procedure:• Process textual metadata to automatically tag EMR text with as
many ontology terms as possible. (Very large #s of records)• Assign Doc IDs to ontology terms• Create enriched lexicon using fairly simple NLP/ counting/
semantic types• Enrichment analysis a la GO, but using the automated
annotations. Analyze tagged data for hypothesis generation• Workflow was able to detect Vioxx/ MI issue with patient
summary data
Workflow for Annotating Data
Used Method to show could have detected Vioxx-MI relationship in patient data through enrichment analysis using ontologies
17
Vioxx Patients (1,560)
MI Patients (1,827)
VioxxMI (339)
Linked Data
• “Linked Data refers to a set of best practices for publishing and connecting structured data on the Web”
“Linked Data - The Story So Far” 2009. Christian Bizer (Freie Universität Berlin, Germany), Tom Heath (Talis Information Ltd, UK) and Tim Berners-Lee (Massachusetts Institute of Technology, USA) DOI: 10.4018/jswis.2009081901, ISSN: 1552-6283, EISSN: 1552-6291
• Similar web documents but for publishing data on the web• “Rules” for linked data (Berners-Lee 2006)
1. Use URIs as names for things.
2. Use HTTP URIs so that people can look up the names
3. When someone looks up a URI provide useful information using the standards (RDF)
4. Include links to other URIs so that they can discover more things
• Conversion of CDEs and terminology to RDF triple store would make queriable by SPARQL.
• Already substantial progress – e.g. Jim McCusker
These 4 “Gender” CDEs could be discovered to be able to be aggregatable by leveraging underlying concepts
“female”“female” “2”“2” “Female”“Female” “Female”“Female”
C16576C16576
http://11179.iso/valueMeaningConcept
C46110C46110
http://11179.iso/valueMeaningConcept
http://11179.iso/valueMeaningConcept
http://11179.iso/valueMeaningConcept
21796402179640
http://11179.iso/dataElementID
25290812529081
http://11179.iso/dataElementID
22006042200604
http://11179.iso/dataElementID
21796412179641
http://11179.iso/dataElementID
http://evs.gov/rdfs:subClassOF
caDSR MetadataEVS Metadata
http://evs.gov/rdfs:subClassOF
http://evs.gov/rdfs:subClassOF
Patient (C16960)Patient
(C16960)
http://11179.iso/objectClassConcept
Person (C25190)Person
(C25190)http://11179.iso/objectClassConcept
Participant (C29867)
Participant (C29867)
http://11179.iso/objectClassConcept
http://11179.iso/objectClassConcept
Discussion: Your current/ future semantic needs?Questions/ Issues/ Requests for SI
• Re: Consensus and use of standards allow data to be collected systematically, correctly and re-usably.• Issue: Lots of standards & if you want to interact with systems/
data using other standards, there’s still a lot of work – • Still better off with documented metadata/ meanings/
terminology/ provenance, more amenable to reuse• Question – what tools / techniques are you using, what do you
need? E.g. mappings, mapping tools (terminology or metadata), easier access to terminology in LexEVS e.g. by REST, etc?
• “Diagnoses” in Tissue Bank System vs CBM – translation? (Static mapping? Live mapping? Metadata and terminology transformation to a common model? Just terminology?)
Discussion: Issues/ Questions cont’d
• How to find out quickly within caTissue if there is a DE that exists?• Use caDSR API to search for DE by name using wildcard search• Future? Rest service or Triple store with SPARQL access?
• Answer the question “how many males and females are on all trials” should return the answer even thought male and female was represented differently in trials -- could do this with a service, or with linked data
• Semantics via e.g. ISA Tab metadata and terminology vs central curation of metadata (e.g. in 11179 repository)? (pros and cons)
• Needs re: NLP. In use, what kind of additional support for caBIG?
• Needs re: automated annotation?
CBIIT Semantic Operations and Infrastructure: Contacts
• Contacts• Sherri de Coronado, Acting
Director. [email protected]• Larry Wright, Associate Director, EVS.
• Margaret Haber, Associate Director, EVS. [email protected]
• Denise Warzel, Associate Director, Informatics Operations. [email protected]
• Dianne Reeves, Associate Director, Biomedical Data. [email protected]
• Gilberto Fragoso, Associate Director, EVS. [email protected]
• NCI/CBIIT Semantic Infrastructure/ Roadmap:
• Christo Andonyadis, Acting CTO. [email protected]
• Selected Urls• NCIm Browser:
http://ncim.nci.nih.gov• NCI Term Browser:
http://nciterms.nci.nih.gov• NCIt Browser:
http://ncit.nci.nih.gov• CDE Browser:
https://cdebrowser.nci.nih.gov/CDEBrowser/
• VKC for LexEVS info: https://cabig-kc.nci.nih.gov/Vocab/KC/index.php/LexBig_and_LexEVS