Semantic annotation of biomedical data
-
Upload
clement-jonquet -
Category
Education
-
view
1.763 -
download
2
description
Transcript of Semantic annotation of biomedical data
- 1.Semantic annotation of biomedical data
Clement Jonquet
[email protected]
INRIA - EXMO seminar - March 24th, 2010
2. Speech overview
- Introduction: semantic annotation, semantic web, biomedical context, the challenge
3. Ontology-based annotation workflow: concept recognition,
semantic expansion, why its hard? 4. Annotation services: the NCBO
Annotatorweb service, the NCBO biomedical resources index 5. Users
& use cases 6. Conclusion and future work2
INRIA - EXMO seminar - March 24th, 2010
7. Annotation & semantic web
- Part of the vision for the semantic web
8. Web content must be semantically described using ontologies
9. Semantic annotations help to structure the web 10. Annotation is
not an easy task 11. Automatic vs. manual 12. Lack of annotation
tools (convenient, simple to use and easily integrated into
automatic processes) 13. Todays web content (& public data
available through the web) mainly composed of unstructured
textINRIA - EXMO seminar - March 24th, 2010
3
14. Annotation is not a common practice
- High number of ontologies
15. Getting access to all is hard: formats, locations, APIs 16.
Lack of tools that easily access all ontologies (domain) 17. Users
do not always know the structure of an ontologys content or how to
use it in order to do the annotations themselves 18. Lack of tools
to do the annotations automatically 19. Boring additional task
without immediate reward for the userINRIA - EXMO seminar - March
24th, 2010
4
20. Biomedical context
- Explosion of publicly available biomedical data
21. Very diverse, grow very fast 22. Most of the data are
unstructured and rarely described with ontology concepts available
in the domains 23. Hard for biomedical researchers to find the data
they need 24. Data integration problem 25. Translational
discoveries are prevented 26. Good example of use of ontologies and
terminologies for annotations 27. Gene Ontology annotations 28.
PubMed (biomedical literature) indexed with Mesh headings 29.
Limitations 30. UMLS only, almost nothing for OBO & OWL
ontologies 31. Manual approaches, curators (scalability?) 32.
Automatic approaches (usability & accuracy?) INRIA - EXMO
seminar - March 24th, 2010
5
33. The challenge
- Automatically process a piece of raw text to annotate it with relevant ontologies
34. Large scale to scale up for many resources and ontologies
35. Automatic to keep precision and accuracy 36. Easy to use and to
access to prevent the biomedical community from getting lost 37.
Customizable to fit very specific needs 38. Smart to leverage the
knowledge contained in ontologiesINRIA - EXMO seminar - March 24th,
2010
6
39. Vocabulary
- Element = a collection of observations resulting from a biomedical experimentorstudy
40. a dataset, clinical-trial description, research
article,imaging study 41. Text metadata=the set of free text that
describe or annotate an element 42. Resource = a collection of
elements 43. GEO, PubMed, ClinicalTrial.gov, Guideline.gov,
ArrayExpress 44. Concept = a unique entity (class) in an specific
ontology (has an URI) 45. UMLS CUI or NCBO URI e.g., C0025202,
DOID:1909 46. Term = a string that identifies a given concept
(name, synonyms) 47. Melanoma, Melanomas, Malignant melanoma 48.
Annotation = meta-information on a data: this data deals with this
concept 49. PMID17984116 deals with C0025202INRIA - EXMO seminar -
March 24th, 2010
7
50. Why using ontologies?
They structure the knowledge from a domain
They specify terms that can be used by natural language processing
algorithms to process text
They uniquely identify concept (URI)
They specify relations between concepts that can be used for
computing concept similarity
They define hierarchies allowing abstraction of type
They play the role of common denominator for various data froma
domain
INRIA - EXMO seminar - March 24th, 2010
8
51. Why using ontologies?
9
INRIA - EXMO seminar - March 24th, 2010
52. Why is it a hard problem? (1/2)
- Identify concept from text is a hard task
53. May involve NLP, stemming, spell-checking, or recognition of
morphological variants 54. Concept disambiguation 55. Scalability
issues 56. We want to deal with millions of concepts (~4M) 57. 200+
ontologies in several format, spread out 58. Huge biomedical
resources e.g., PubMed 17M citations 59. What to do with
annotations when the ontologies and the resources evolve over time
60. e.g., elements in resources are added 61. e.g., concepts in
ontologies are removed INRIA - EXMO seminar - March 24th,
2010
10
62. Why is it a hard problem? (2/2)
How to leverage the knowledge contained in ontologies?
Process the transitive closure for relations (not trivial for
ontologies with 300k concepts)
Execute semantic distance algorithms to determine similarity
Compute mappings between ontologies to connect ontologies one
another
Keep all of this up to date when ontologies evolve
e.g., new GO version everyday
INRIA - EXMO seminar - March 24th, 2010
11
63. Ontology-based annotation workflow
INRIA - EXMO seminar - March 24th, 2010`
12
First, direct annotations are created by recognizing concepts in
raw text,
Second,annotations are semantically expanded using knowledge of the
ontologies,
Third, all annotations are scored according to the context in which
they have been created.
64. Concept recognition (step 1)
- Uses a dictionary: a list of strings that identifies ontology concepts
65. 220 ontologies, ~4.2M concepts & ~7.9M termsUses NCIBI
Mgrep, a syntactic concept recognizer
High degree of accuracy
Fast, scalable,
Domain independent
13
INRIA - EXMO seminar - March 24th, 2010`
66. Semantic expansion (step 2)
- Uses is_a hierarchies defined by original ontologies
67. Uses mapping in UMLS Metathesaurus and NCBO BioPortal 68.
Usessemantic- similarity algorithms based on the is_a graph
(ongoing work) 69. Componentsavailable asweb services14
INRIA - EXMO seminar - March 24th, 2010`
70. An example
- Melanoma is a malignant tumor of melanocytes which are found predominantly in skin but also in the bowel and the eye.
71. NCI/C0025201, Melanocyte in NCI Thesaurus 72.
39228/DOID:1909, Melanoma in Human Disease 73. Is_a closure
expansion 74. 39228/DOID:191, Melanocytic neoplasm, direct parent
of Melanoma in Human Disease 75. 39228/DOID:0000818, cell
proliferation disease, grand parent of Melanoma in Human Disease
76. Mapping expansion 77. FMA/C0025201, Melanocyte in Foundational
Model of Anatomy, concept mapped to NCI/C0025201 in UMLS.INRIA -
EXMO seminar - March 24th, 2010`
15
- Melanoma is a malignant tumor of melanocytes whichare found predominantly in skin but also in the bowel and the eye.