ONTOLOGY ENRICHMENT FROM TEXT
Transcript of ONTOLOGY ENRICHMENT FROM TEXT
ONTOLOGY ENRICHMENT FROM TEXT
Student Name: Nikhil SachdevaRoll Number: 2016061
BTP report submitted in partial fulfillment of the requirementsfor the Degree of B.Tech. in Computer Science & Engineering
on June 8, 2020
BTP Track: Research
BTP AdvisorDr. Raghava Mutharaju
Indraprastha Institute of Information TechnologyNew Delhi
Student’s Declaration
I hereby declare that the work presented in the report entitled Ontology Learning from Textsubmitted by me for the partial fulfillment of the requirements for the degree of Bachelor ofTechnology in Computer Science & Engineering at Indraprastha Institute of Information Tech-nology, Delhi, is an authentic record of my work carried out under guidance of Dr. RaghavaMutharaju. Due acknowledgements have been given in the report to all material used. Thiswork has not been submitted anywhere else for the reward of any other degree.
.............................. Place & Date: .............................Nikhil Sachdeva
Certificate
This is to certify that the above statement made by the candidate is correct to the best of myknowledge.
.............................. Place & Date: .............................Dr. Raghava Mutharaju
2
Abstract
Ontology Learning is the semi-automatic and automatic extraction of ontology from a naturallanguage text, using text-mining and information extraction. Each ontology primarily representssome concepts and their corresponding relationships, both of which can be extracted from a nat-ural language text. Current state-of-the-art ontology learning systems are not able to extractunion and intersection relationships between the concepts while extracting an ontology from thetext. These relationships can prove to be critical in some domains like the medical domain,where such expressive relationships can capture the missing knowledge of some concepts in thedomain. By learning such relationships and then reasoning over the learned ontology, we mightbe able to answer additional queries regarding the properties of those medical concepts. In thisproject, I have proposed an architecture to extract these types of relations from a text in themedical domain, using Semantic and Syntactic Similarity and Graph Search. In the evaluation,I enrich a Disease Ontology with union and intersection axioms that are learned from the text.I have evaluated the architecture using metrics such as Precision, Recall, and F1-scores, bycomparing the original and the learned axioms.
Keywords: Ontology Learning, Concepts, Relations, Union, Intersection, Semantic, Syntactic,Graph, Medical Domain
Acknowledgments
I would like to thank my adviser Dr. Raghava Mutharaju for guiding me, and providing me withthe insights and expertise and giving me the opportunity to pursue this project. I would alsolike to thanks Monika Jain (Ph.D Scholar at IIIT-D) for her guidance throughout the project.Finally, I would like to thank my batchmate, Gyanesh Anand, who helped me in understandingthe utility of certain tools in the medical domain.
i
Contents
1 Introduction 1
1.1 What is Ontology Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Ontology Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 What is missing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Survey 4
2.1 Techniques in Ontology Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Statistic-based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Linguistic-based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Logic-based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Axiom Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 BioWordVec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 UMLS Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Metathesaurus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 Metamap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Evaluating Existing Ontology Learning Systems . . . . . . . . . . . . . . . . . . . 8
3 Proposed Architecture 12
3.1 Generalized Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Entity Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Training of Custom Word2Vec model . . . . . . . . . . . . . . . . . . . . . 13
3.2.3 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
ii
3.2.4 Ontology Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.5 Axiom Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Evaluations 19
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 Ontology Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Corpus Extraction and Pre-processing . . . . . . . . . . . . . . . . . . . . 20
4.2 Evaluation of Entity Linking Module . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Evaluation of Complete Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Challenges 23
6 Future Scope 24
iii
Chapter 1
Introduction
An Ontology is an explicit specification of a conceptualization, comprising a formal definition of
concepts, relations between concepts, and axioms on the relations in the domain of interest [14].
Ontologies represent the intensional aspect of a domain governing the way the corresponding
knowledge bases are populated. They have been proven useful for different applications, such
as heterogeneous data integration, information search and retrieval, question answering, and
semantic interoperability between information systems. Moreover, for advanced applications
like semantically enhanced information retrieval and question answering, there must be a clear
connection between the formal representation of the domain and the language used to express
domain meanings withing text. This task can be accomplished by producing a full-fledged
ontology for the domain of interest. This is where the challenge of ontology acquisition originates.
1.1 What is Ontology Learning?
Manually constructing ontologies is a resource-intensive and challenging task, requiring a large
amount of time and effort. The reason being the difficulty in capturing knowledge, also known as
the “knowledge acquisition bottleneck”. Much research has been going on in this challenge, re-
ferred to as Ontology Learning, which consists of automatically and semi-automatically creating
a domain ontology using unstructured textual data. The entire process of ontology construction
of a domain is depicted in Figure 1.1.
The process begins by extracting terms relevant to the domain, from the underlying text. Then,
these terms are combined with their synonyms, that can be captured from the text itself or
available lexical databases like WordNet [25], to form concepts. For relations, taxonomic and
non-taxonomic relations between the concepts are extracted from the text. Additionally, based
on the concepts and relations derived, general axiom schemata can be instantiated to extract
axioms from the text.
1
Figure 1.1: Ontology Learning Layer Cake
1.2 Ontology Layer
1.2.1 Concept
Terms extracted from the text are combined to form concepts. These concepts are labeled and
profiled to give an image to classes in ontology.
1.2.2 Relations
Taxonomic Relations
These relations represent hypernym-hyponym relations; that is, the “is a” relations between the
concepts. They are responsible for forming hierarchies in an ontology.
Non-taxonomic Relations
These relations are generated using interactions between the found concepts. A major emphasis
for these interactions is given to verbs. Relations like meronyms, that is, “part of”, roles,
attributes, possessions, etc. are covered in this category.
2
1.2.3 Axioms
Axioms are the rules for verifying the correctness of ontologies. They are identified based on
certain patterns and generalizations found among relations in the text.
1.3 What is missing?
Based on the previous section, you can see that the learning systems have extracted only those
types of relations, which either define the parent-child relationships between the concepts to
form hierarchies or to connect the concepts based on their interactions. What has not been
considered are those types of relations that may change the class definition itself. That is, rela-
tions like intersection and union relations between the concepts are different from the relations,
as mentioned earlier, as they represent a unique relationship between a concept and a group of
concepts. For instance, a context mentions, “... where a Mother is a Female and a Parent,....”.
The existing systems evaluated in Chapter 3, will identify three concepts from this context:
Mother, Female, Parent.
Moreover, they will also identify the two pairs of Subclass relations: (Female, Mother) and
(Parent, Mother). But, neither of them will be able to extract the intersection relation between
concept: Mother and group of concepts: (Parent, Female). Take another example, “ .. noise is
defined as Sound, with non-Musical features... ”, where there is again an intersection relation
between the concept Noise and group of concepts (Sound, not(Music)). Till now, all the systems
I have reviewed, have been able to extract only subclass relations, when we are talking about
taxonomic relations between the concepts. Hence, as part of my thesis, I am going to create
algorithms to extract further intersection and union relations between the identified concepts.
1.4 Motivation
Specifically in the medical domain, extracting such relations between concepts can help to iden-
tify missing properties, for example, in case of some drugs, that may belong to the intersection
or union of other drugs. One other situation can be where we are trying to identify a solution
for a disease D1 that is logically related as a union of some other diseases like D2, D3, etc. then
we can first apply those solutions that exist for D2 and D3, in the research of D1.
3
Chapter 2
Literature Survey
2.1 Techniques in Ontology Learning
Till now, many techniques from different fields like Machine Learning, Natural Language Pro-
cessing, Data Mining, Information Retrieval, Logic, and Knowledge representation have been
used to varying layers of Ontology Learning. Below are some of the techniques that are com-
monly used by many ontology learning systems to achieve greater levels of accuracy at each
layer of ontology learning.
2.1.1 Statistic-based Techniques
Statistic-based techniques are employed due to a lack of consideration for the underlying seman-
tics and relations between the components of the text. These techniques are mostly used for
term, concept, and taxonomic relation extraction.
• Term Extraction:
– C/NC Value [10]: It is used for multi-word terminology extraction, where it receives
multi-words as input and returns the score for each of them.
– Contrastive Analysis [10, 17, 28]: This technique filters out non-domain terms from
the results of term extraction based on two measures, domain consensus, and domain
relevance.
– Co-occurrence Analysis [9]: This technique scores different lexical units based on
their adjoined occurrence and gives an indirect association between various terms
and concepts.
– Clustering [9, 11]: It is an unsupervised learning technique in which similar objects
are grouped or clustered together, giving rise to many clusters.
– Term Subsumption: This technique comments on how general a word is when oc-
curring in a corpus. It is calculated between words using their inter-conditional
4
probability of occurring in the given corpora.
• Relation Extraction:
– Formal Concept Analysis [10]: It tries to use attributes of the word itself and uses
them to associate different terms together.
– Hierarchical Clustering [9]: This technique uses different similarity measures like Co-
sine Similarity, etc. to find taxonomic relations among the terms, then find similar
concepts based on clustering of those terms and finally, generate a hierarchy of con-
cepts.
– Association Rule Mining [9]: It is used to find associations between concepts at a
certain level of abstraction, clustering certain concepts under one category.
2.1.2 Linguistic-based Techniques
Linguistic-based techniques depend on the characteristics of a language and hence, use Natural
Language Processing tools in almost all of its algorithms. They are mainly used for preprocessing
of the data as well as term and relation extraction from the data.
• Preprocessing [9, 17]
– Parsing: It is a type of syntactic analysis that finds various dependencies between
words and represents in the form of a parse tree.
– POS Tagging: This technique assigns Part-of-Speech tags to different words on a
sentence level, to aide in the analysis of other algorithms.
– Seed Words: They are domain-specific words that provide a base for other algorithms
to extract similar domain-specific terms and concepts.
• Relation Extraction [9, 11,17,26]
– Lexico-syntactic parsing: It is a rule-based approach that uses regular expressions to
extract taxonomic and non-taxonomic relations from text.
– Dependency Analysis: It uses dependency paths between the terms from the parse
trees to find potential terms and relations.
– Sub-categorization Frame: In terms of words, it is the number of words of a certain
form that the word selects when appearing in a sentence. Then, this frame shapes
the kind of classes with which the word usually connects.
– Semantic Lexicons: These are already existing knowledge resources in the domain of
ontology. For instance, WordNet [25] provides a wide range of pre-defined concepts
and relations to bootstrap the process at certain levels of ontology learning.
5
2.1.3 Logic-based Techniques
These techniques are employed at the top two layers of Ontology Learning layer cake, that is,
Axiom schemata and General Axioms, which are then used to reason over the ontology for their
correctness and recall.
• Inductive Logic Programming [15, 26]: In ILP, rules are derived from existing collections
of concepts and relations using semantic lexicons and syntactic patterns, to divide the
extracted concepts and relations into positive and negative examples.
• Logical Inference: Using transitivity and inheritance rules, implicit rules are derived from
the existing ones.
2.2 Related Work
As of now, there has been no work done to extract semantically rich relations like SubClass union
and SubClass intersection between the concepts of an existing ontology, using its corresponding
text. Furthermore, I was not able to find any dataset where there exist an ontology and closely
related text. The following are the two research projects that were closely related to our project.
2.2.1 Entity Linking
[18] proposed an unsupervised approach for Entity Linking in the life sciences domain. Their
proposed method takes as input an ontology and the corresponding text and links the entity
mentions from the text with the concepts of the ontology. They have used BioNLP16 [8] word
embeddings for comparing the semantic similarity between the concept and the entity mention
and have also proposed an algorithm that re-ranks the candidate concepts generated for an
entity mention based on their syntactical structure. They find the most informative word from
the entity mention and the concept, then it again compares the semantic similarity between
that word and the candidate concept based on those headwords. Their approach only considers
the headword of either of them, when it may be possible that the complete noun phrase may
be required for more accurate meaning. The dataset they have used has been provided under
the BioNLP16 task, but in the ontology provided in that dataset, there was not a single axiom
based on the concepts.
2.2.2 Axiom Extraction
[29] proposed an approach to extract lexically expressive ontology axioms from textual defini-
tions using Formal Concept Analysis and Relational Exploration. By parsing the dependency
structure of the input text, they try to transform the dependency tree into a set of OWL axioms
using their own transformation rules. They have used these heuristics to extract the axioms from
6
simple and straightforward definitions. Their approach might not work for long and complicated
sentences, where the context is very much important.
2.3 BioWordVec
BioWordVec [32] is a set of biomedical word embeddings which are trained over 28 million
PubMed articles and 2 million MIMIC III Clinical Notes. It uses the contextual information
around a word from the unlabeled biomedical text along with the vocabulary, Medical Subject
Headings (MeSH), to generate real-valued word embeddings of 200 dimensions. Furthermore,
the model is also able to compute word embeddings for out-of-vocabulary terms. BioWordVec
has shown high evaluation scores when tested to evaluate the pre-trained word embeddings on
medical word pair similarity.
2.4 UMLS Resources
Unified Medical Learning System (UMLS) [6] is a licensed service provided by the National
Library of Medicine that provides tools and resources to integrate many health and biomed-
ical vocabularies and standards to further create more effective and interoperable biomedical
information systems.
2.4.1 Metathesaurus
The Metathesaurus [2] is the biggest component of UMLS that integrates concepts, related
meanings, and similar concepts from 200 different vocabularies into a single structured network.
Some of the widely-used vocabularies are SNOMEDCT-US, ICD-10, CDT, MeSH, Within each
source, each of the concepts is represented by a unique CUI or Concept Unique Identifier. Each
CUI represents the intended meaning of the concept. Further, each unique concept is assigned
an SUI or String Unique Identifier that represents a particular string form of that concept.
Different string forms of the same concept will be assigned different SUIs within the same
vocabulary source. Then, comes the AUI or Atom Unique Identifier, which represents every
occurrence of any string uniquely in each of the vocabularies. AUIs are the basic building blocks
of Metathesaurus.
2.4.2 Metamap
Metamap [4] is a complete tool that provides various algorithms for relating the entity mentions
of any biomedical text with the concepts of UMLS Metathesaurus. Further, it also gives out a
score for each connection, stating how good is the connection of an entity with the found con-
cept of Metathesaurus. Other features of Metamap include word sense disambiguation (WSD),
polarity detection and predicate resolution, and some other technical algorithms.
7
2.5 Evaluating Existing Ontology Learning Systems
I have evaluated many ontology learning systems in constrained ways, depending upon the form
of their availability. Most of the systems were either made public for a limited period or never
available. But, there are research articles and evaluation papers done on those systems, which
can also be used to evaluate the systems.
Below is the list of ontology learning systems whose evaluation was done based on their proposed
research articles and evaluation done on them by other researchers and survey papers [5,13,16,
31].
Systems Ontology Lay-
ers
Techniques Relations Comments
on source
Availability
ASIUM [11] Terms, Con-
cepts, Taxo-
nomic Rela-
tions
Statistic: Agglomera-
tive Clustering
Linguistic: Sentence
Parsing, Syntactic
structure analysis, sub-
categorization frames
SubClass(
Prospective -
Union based
on patterns)
Publicly not
available, only
paper
TextStorm/
Clouds [26]
Terms, Tax-
onomic Re-
lations, Non-
Taxonomic
Relations,
Axioms
Linguistics: POS
tagging using Word-
Net [25],Syntactic
structure analysis,
Anaphora resolution
Logic: Inductive Logic
Programing
SubClass Publicly not
available, only
paper
SYNDIKATE
[15]
Terms, Con-
cepts, Taxo-
nomic Rela-
tions, Non-
Taxonomic
Relations
Linguistics: Syntactic
structure analysis,
Anaphora resolution
Semantic Templates
and Domain Knowledge
Logic: Inference Engine
SubClass Publicly not
available, only
paper
OntoLearn
[28]
Terms, Con-
cepts, Taxo-
nomic Rela-
tions
Statistics: Relevance
Analysis
Linguistics: POS tag-
ging, Sentence Parsing,
WordNet [25] - Con-
cepts and Hypernyms
SubClass Source was
made public,
but never
found it, only
paper is avail-
able
8
CRCTOL
[17]
Terms, Con-
cepts, Taxo-
nomic Rela-
tions, Non-
Taxonomic
Relations
Statistics: Relevance
Analysis
Linguistics: POS tag-
ging, Sentence Parsing,
Domain lexicons, Word
Sense Disambigua-
tion Lexico-syntactic
patterns, syntactic
structure analysis
SubClass(
Prospective -
Intersection
and Union)
Publicly not
available, only
paper
OntoGain
[10]
Terms, Con-
cepts, Taxo-
nomic Rela-
tions, Non-
Taxonomic
Relations
Statistics: Agglomera-
tive Clustering, Formal
Concept analysis Asso-
ciation Rule mining
Linguistics: POS Tag-
ging, Shallow Parsing,
Relevance Analysis
SubClass Publicly not
available, only
paper
Text-to-
Onto [23]
Terms, Con-
cepts, Taxo-
nomic Rela-
tions, Non-
Taxonomic
Relations
Statistic: Co-
occurrence Analysis,
Agglomerative Cluster-
ing, Association Rule
Mining
Linguistic: POS tag-
ging, Sentence Parsing,
Syntactic structure
analysis Domain Lex-
icons for concepts
Hypernyms from
WordNet [25], Lexico-
syntactic patterns
SubClass Publicly avail-
able, along
with a paper
9
Text2Onto
[9]
Terms, Con-
cepts, Taxo-
nomic Rela-
tions, Non-
Taxonomic
Relations
Statistic: Co-
occurrence Analysis,
Agglomerative Cluster-
ing, Association Rule
Mining
Linguistic: POS tag-
ging, Sentence Parsing,
Syntactic structure
analysis Domain Lex-
icons for concepts
Hypernyms from
WordNet [25], Lexico-
syntactic patterns
SubClass Publicly avail-
able, along
with a paper
T2K [22] Terms, Con-
cepts, Taxo-
nomic Rela-
tions
Statistics: Agglomera-
tive Clustering, Formal
Concept analysis, Asso-
ciation Rule mining
Linguistics: POS Tag-
ging, Sentence Parsing,
Syntactic Structure
Analysis, Relevance
Analysis
SubClass Publicly not
available, only
paper
10
Below is the list of ontology learning systems whose source codes are available, along with the
user interfaces for analysing outputs on different input text, from different domains.
Systems Ontology Layers Relations Comments
DL-Learner
[7]
Terms, Concepts,
Taxonomic Re-
lations, Non-
Taxonomic Rela-
tions, Axioms
SubClass, Intersec-
tion
Structured data is the nec-
essary component of differ-
ent layer extractions
DOME [1] Terms, Concepts,
Taxonomic Re-
lations, Non-
Taxonomic Relations
SubClass
DOG4DAG
[30]
Terms, Concepts,
Taxonomic Re-
lations, Non-
Taxonomic Relations
SubClass Domain Independent
DOODLE-
OWL [12]
Terms, Concepts,
Taxonomic Re-
lations, Non-
Taxonomic Relations
SubClass Requires Domain knowl-
edge like input keywords,
etc.
As you can see in the above table, there is not a single system that can form the SubClass
union and intersection axioms after extracting the ontology from the text. Hence, I propose my
own architecture in the subsequent section, with which we can form such relations between the
existing concepts of the ontology.
11
Chapter 3
Proposed Architecture
Before moving on to the approach, let us consider what resources we have and what will be the
end result. Initially, we have a certain article, which talks about some knowledge and definitions,
and an ontology on which that article is based on. Basically, we know that whatever entities,
named entities, relations, and concepts are mentioned in the text, will be directly related to the
concepts mentioned in the ontology.
3.1 Generalized Model
With the given text and ontology, I want to find enriching relations like union/intersection,
between the concepts of the ontology. For my thesis, I have proposed an unsupervised approach
to work in the medical domain, which may be extended to be used independently of the working
domain. The core basis of my approach is based on the semantic and syntactic similarity between
the concepts and the entity mentions in the text. As you can see in Figure 3.1, it is the most
generalized model for my approach. The only module that can be improved is the “Entity
Linking” Module in the given approach. The rest of the modules are pre-processing and factual
algorithms, that can remain the same but maybe changed when computational resources are
concerned.
Figure 3.1: Generalized Model
12
Figure 3.2: Proposed Model
3.2 Proposed Model
As for the Entity Linking Module of the model (Figure 3.2) used in this paper, I have used the
following approach for which I have given a detailed explanation in the subsequent sections: (i)
comparing the real-valued vector representations of the entity and the candidate concept, using
the pre-trained word embedding model BioWordVec [32], (ii) comparing the real-valued vector
representations of the entity and the candidate concept using a custom word2vec model trained
on the text in which the entities are mentioned, and (iii) comparing the entity and the candidate
concept using the tree-search algorithm on UMLS Metathesaurus [2].
3.2.1 Entity Extraction
First is the entity extraction module, where we will extract all such entities which are relevant
to the medical domain. All these entity mentions that appear in the text can either be a noun
or a noun phrase or a verb or a verb phrase. We will extract all such entity mentions using
the python library SciSpacy, which is based on the widely-used Spacy library. SciSpacy has
provided many pre-trained models for extracting medically-relevant entities, of which I have
used “en core sci lg”.
After the first module, we now have the list of entity mentions (E) from the text and list of
concepts (C) from the ontology.
3.2.2 Training of Custom Word2Vec model
Before moving on to the entity linking module, we will be needing a custom word2vec model
that will be trained on the text corpus given as input. The reason being, the word embeddings
generated based on this model will capture the context around the target word for which the
word embedding is calculated.
Using the Gensim python library, I have trained a custom word2vec model using the input text
13
corpus that will generate embeddings of 200 dimensions. Furthermore, while training the model,
I have considered the context window of 5 around the target word (for example, Pneumonia:
W1 W2 W3 W4 W5 pneumonia W6 W7 W8 W9 W10
where Wi are the contextual words around the target word Pneumonia.
3.2.3 Entity Linking
In this second module, we want to find those candidate concepts that can act as a parent class of
each of the entity mentions found in the text, i.e., if that entity can be the type of the candidate
concept. For every entity mention, we will iterate over all the candidate concepts and pass the
concept-entity (c-e) pair through the 3 modules of Entity Linking module sequentially.
BioWordVec based comparison
Using the BioWordVec [32], we will generate the real-valued word embeddings for both candidate
concept and entity mention, WE(c), and WE(e) respectively. After generating the embeddings,
we will compare their cosine similarity and if:
cosine(WE(c),WE(e)) > 0.75 (3.1)
is passed, the candidate-concept pair will be passed onto the second module. The value 0.75 is
completely experimental basis and can vary based on the modules used in combination with it.
Custom Word2Vec based comparison
Using the custom Word2Vec model, we will generate the real-valued word embeddings for both
candidate concept and entity mention, WE(c), and WE(e) respectively. After generating the
embeddings, we will compare their cosine similarity and if:
cosine(WE(c),WE(e)) > 0.55 (3.2)
is passed, the candidate-concept pair will be passed onto the second module. Again, the value
0.55 is completely experimental basis and can vary based on the modules used in combination
with it.
UMLS Metathesurus Search
After passing the 2 modules of Entity Linking module, we will finally compare if the candidate
concept can be good to be the parent class of the entity mention based on tree search on UMLS
Metathesaurus [2]. While searching the tree, the concept-entity connection can be based on
14
Figure 3.3: Scenario-1: Where the entity men-tion is found directly under the sub-tree of thecandidate concept.
Figure 3.4: Scenario-2: Where there exists aNearest common ancestor of both the conceptand entity menion.
two scenarios. If the pair is found to be valid pair based on scenario-1, then it will be directly
added as concept-instance pair into the ontology. If it fails in scenario-1, it will be checked for
scenario-2. Finally, based on scenario-2, a decision will be made if there is a quality connection
between the concept and the entity mention. Now, I will explain both the scenarios.
Scenario 1: For the first scenario, we will extract all the descendants of the candidate
concept from UMLS Metathesaurus across all the relevant sources. All the descendants that are
returned will be represented by their unique CUI or SUI or AUI across the sources. We will
iterate over this list of descendants and check if the concerned entity is present directly under
the sub-tree of the candidate concept. That can be achieved by just comparing the target entity
with each of the listed descendants. Now, do note that we have to extract the descendants of
the concept from each of the source vocabularies.
In Figure 3.3, considering that “Pneumonia” is the candidate concept and “Congenital Bac-
terial Pneumonia” is the entity mention, we can see that “Congenital Bacterial Pneu-
monia” comes directly in the descendants of the concept “Pneumonia”. So, after this com-
parison, “Congenital Bacterial Pneumonia” will be directly added as an instance of the
concept “Pneumonia”.
Now, consider another example, where the candidate concept is “Infective Pneumonia” and
the entity mention is “Basal Pneumonia”. After searching through the descendants of “In-
fective Pneumonia”, we found that “Basal Pneumonia” does not match with any of the
listed descendants across all the sources. So, this pair will now be considered for Scenario-2.
Scenario 2: For the second scenario, we will compare the ancestors of both the candidate
concept and the entity mention. In the Figure 3.4, considering the latter example under Scenario-
15
Figure 3.5: Results of Ancestors of Infective Pneumonia and Basal Pneumonia
1, we will compare the ancestors of “Infective Pneumonia” and “Basal Pneumonia” and
locate the Least Common Ancestor using the recursive algorithm. If you can see in Figure 3.5,
after running the algorithm, we found that “Respiratory Finding” is the lowest common
ancestor of “Infective Pneumonia” and “Basal Pneumonia”.
After we have located the Lowest Common Ancestor “Respiratory Finding”, we will now com-
pare the scores of pairs (“Respiratory Finding”, “Infective Pneumonia”) and (“Respiratory
Finding”, “Basal Pneumonia”), which is generated when we put both the pairs appended in
two individual sentences and pass those sentences into Metamap [4]. Metamap returns a score
for the pair and if:
m1 = Metamap(“Infective Pneumonia, a Respiratory Finding”)
m2 = Metamap(“Basal Pneumonia, a Respiratory Finding”)
(m1 + m2)/2 > 0.85 (3.3)
then the entity mention “Basal Pneumonia” will be added as an instance of the candidate
concept “Infective Pneumonia”. Otherwise, the pair is ignored.
3.2.4 Ontology Modification
Whenever we find a connection between an entity mention and a candidate concept, the same
entity mention is also added as an instance of all the subsequent parent class of the candidate
concept. The reason being, all those parent concepts act as a more generalized representation
of the same entity mention. So, they need to be connected with the entity mention.
After running the complete model for each of the entity mentions and candidate concepts, we
will have the following dictionary of concepts, D, where:
D = {C1, C2, C3, C4, .....}
16
where
D[C1] = [I1, I2, I3, I4, .....]
D[C2] = [I2, I5, I6, .....]
D[C3] = [I3, I4, I5, .....]
........................
3.2.5 Axiom Extraction
To form all the axioms based on a list of concepts, we will just find union and intersection
between them, using a dynamic programming paradigm.
At Phase 1, we will find SubClass union and SubClass intersection between each pair of concepts
and if the corresponding set is not empty and matches with the set of 3rd concept, then for each
type of axiom, we will add the axioms in the following way:
We will compare two types of axioms: SubClass union and SubClass intersection, which can be
described as the following:
A v B t C (3.4)
where A, B, C are different concepts and A is a SubClass of the union of B, and C; and
W v X u Y (3.5)
where W, X, Y are different concepts and X is a SubClass of the intersection of X, and Y.
Consider the union axiom 3.4. The left hand side of the axiom will become the key and the
right-hand side will become the value of list of concepts that form the relation,:
Union[“A”] = [“B”, “C”]
When another union axiom is found, with “A” being on the left-hand side, they will be just
appended to the existing list of “A” as:
A v E tD (3.6)
Considering 3.6 axiom is found after 3.4, then the dictionary will be updated as:
Union[“A”] = [“B”, “C”, “D”, “E”]
Now, consider intersection axiom 3.5. There can be multiple intersections that can exist on a
single target concept, so they will be stored as separate list of hyphenated strings in following
way:
17
Intersection[“W”] = [“X-Y-Z”]
When another intersection axiom is found, with “W” being on the left-hand side, it will be
added as a separate axiom to the existing list of “A” just like before as:
W v P u Z (3.7)
Considering 3.7 axiom is found after 3.5, then the dictionary will be updated as:
Intersection[“W”] = [“X-Y-Z”, “P-Z”]
At Phase 2, we will check if there are any axioms formed using a combination of 3 concepts.
That can be done by comparing the original list of concepts and the dictionaries of axioms
formed at Phase 1.
At Phase 3, we will check if there are any axioms formed using a combination of 4 concepts.
That can be done by comparing the original list of concepts and the dictionaries of axioms
formed at Phase 2 and comparing each dictionary with itself, which were formed at Phase 1.
This will be done until Phase 6 because most of the existing axioms had a logical meaning
at most Phase 6. Furthermore, the phase limit was also decided on an experimental basis,
comparing the F-1 scores obtained when restricting to each phase.
18
Chapter 4
Evaluations
4.1 Dataset
While working in the medical domain, I was not able to find any corpora whose corresponding
ontology (axioms and instances included) existed. Finding an ontology rich in axioms and
connected instances was the main issue. To tackle the same problem, I found an ontology, the
Disease Ontology [19], which consisted of many axioms between their concepts, but still there
were no instances that were connected to any of the concept. So, first I had to pre-process the
ontology and then, extract many such articles which were related to each of the instances and
concepts.
4.1.1 Ontology Pre-Processing
As you can see in the Figure 4.1 of the Disease Ontology, there no instances connected to
any concept. Basically, by connection, I mean no instance has been shown as a type of any
concept mentioned in the ontology. But, if you can see here, the lowermost leaf-child concept,
for example, “external ear basal cell carcinoma” can act as an instance of the directly connected
parent class “external ear carcinoma”. Subsequently, it can also be connected as an instance
of all the subsequent parent classes in the hierarchy (red underlined). Furthermore, some class
axioms are connected with these newly formed instances (in Figure 4.2). These class axioms
(SubClass Axioms) are resolved by just connecting the newly-formed instance, for example,
“external ear basal cell carcinoma” with the mentioned class names, which in this case, are
“basal cell carcinoma” and “external ear carcinoma”. This process will be executed for all the
newly-formed instances. Finally, we have a total of 10085 concepts and 7875 instances. Total
number of pairs (concept-instance) are 42511.
19
Figure 4.1: Scenario-1: The Disease Ontology Representation.
Figure 4.2: Scenario-2:Axioms related to thenewly-formed instance“external ear basal cellcarcinoma”.
4.1.2 Corpus Extraction and Pre-processing
After Ontology Pre-processing, we have received all the newly formed instances. We will extract
all the publicly available pubmed abstracts, which are based on these instances. While extracting
the articles, these instances and their subsequent parent classes will be fed as keywords. Using
EntrezPy python library, we can extract top 4-5 articles which are majorly based on the inputted
keywords. After extraction, we have a total of 739 articles.
4.2 Evaluation of Entity Linking Module
After executing the Ontology Pre-processing, we can extract all the concept-instance pairs and
store them in a dictionary where the key is a string literal comprising of hyphenated concept-
instance pairs and the value is ‘1’. For example, since there is a connection between “external
ear carcinoma” and “external ear basal cell carcinoma”, so it will be added into the dictionary
as:
d[“external ear carcinoma-external ear basal cell carcinoma”] = 1
Consider this dictionary as PrePairs, with all those pairs that actually existed in the Ontology
before running the Entity Linking module. After running the module, we have found new
connections between the existing concepts and instances. Consider PostPairs as a separate
dictionary of such pairs that were found between the existing concepts and instances after
running the module. Now, after comparing the pairs in both the dictionaries, I got following
metrics: Recall of 47.8 and Precision of 9.58. The final F1-score is 15.97.
20
As you can see, the F1-score is hugely impacted because of the low Precision. Reason being,
a huge number of new connections (False Positives) were found between the existing concepts
and instances. As we do not have a dataset where an ontology and text are very much closely-
related, many connections were found in the ontology based on the extracted abstracts. This
resulted in a very low F1-score.
4.3 Evaluation of Complete Model
To evaluate the complete model, we will just compare the axioms that initially existed in the
ontology and newly-formed axioms, after running the recursive algorithm on the generated
concept-instance pairs. Total number of SubClass axioms that initially existed in the ontology,
are 10085 intersection axioms and 323 union axioms.
Consider two dictionaries: PreUnion and PreIntersection, of axioms that originally existed in
the Disease Ontology. After running the complete model and calculating the axioms based on
the last module of the model, the newly formed axioms will be stored in the corresponding
dictionaries: PostUnion and PostIntersection. Some of the extracted examples are shown in
Table 4.1.
Axiom Type Actual Axiom Extracted Axiom
SubClass
Intersection
Ampulla of Vater Carcinoma sub-
ClassOf Ampulla of Vater Cancer
and Carcinoma and Disease and
Cancer
Ampulla of Vater Carcinoma sub-
ClassOf Ampulla of Vater Cancer
and Ampulla of Vater Neoplasm and
Carcinoma and Biliary Tract Benign
Neoplasm and Benign Neoplasm and
Disease and Cancer
SubClass
Intersection
Duodenal Obstruction subClassOf
Duodenal Disease and Disease
Duodenal Obstruction subClassOf
Duodenal Disease and Disease
SubClass
Intersection
Histiocytosis subClassOf Lym-
phatic System Disease and Disease
Histiocytosis subClassOf Lym-
phatic System Disease and Disease
SubClass
Intersection
Lymph Node Cancer subClassOf
Lymph Node Disease and Lymphatic
System Cancer and Disease and
Cancer
Lymph Node Cancer subClassOf
Lymph Node Disease and Lym-
phedema and Lymphatic System
Cancer and Lymphatic System Dis-
ease and Disease and Cancer
21
SubClass
Union
Prader-Willi Syndrome sub-
ClassOf Paternal variant or
Chromosomnal deletion or
Loss of function variant or Chro-
mosomnal translocation or Mater-
nal uniparental disomy
Prader-Willi Syndrome sub-
ClassOf Paternal variant or
Chromosomnal deletion or
Loss of function variant or Chro-
mosomnal translocation or Chro-
mosomnal transposition or Chro-
mosomnal structure variation or
Maternal uniparental disomy
Table 4.1: Examples of actual SubClass axioms and respectively extracted axioms.
As you can see in the above table, simpler axioms were being extracted accurately, but as we
increase the complexity of the axiom, more new concepts start getting connected as part of the
subClass axioms. Finally, comparing both the pairs of dictionaries, (PreIntersection and PostIn-
tersection) and (PreUnion and PostUnion), we got the corresponding F1-scores as: SubClass
Intersection = 19.08 and SubClass Union = 14.2. Considering the metrics we obtained
from the evaluation of entity linking module, a lot of false positive pairs were generated, and
hence, we got many new axioms in the case of SubClass union axioms. As for SubClass inter-
section ones, many news axioms were formed based on the existing EquivalentClass intersection
axioms. Hence, the Precision and corresponding F1-scores decreased.
22
Chapter 5
Challenges
At the beginning of this semester, we started by discussing the approach for the general domain.
While working in the general domain, initially I tried to extend the work of [29], using NLP
Stanford Parser [24] for parsing the dependency structure and transformation rules for creating
the DL axioms, but as I mentioned earlier, this could only work for simple sentences that
basically give a definition or functional aspect of some concept. I tried to capture the context
within long paragraphs but after applying the transformation rules, the generated axioms were
meaningless long sentences.
While working in the general domain, I researched over some Entity Linking methods, where they
try to link the entity mentions in the text corpora with a curated list of concepts. Furthermore,
these methods have used existing word embedding models like Glove [27] and Word2Vec [3] for
the semantic similarity between the entity mentions and the concepts. In combination with
these models, I tried to compare the entity and concept based on their relational connections,
i.e., what type of relations do the concept and entity exhibit. This required relational extraction
from text, but again required very rich text corpus and ontology, which I could not find in the
general domain.
Finally, after shifting to the medical domain, as I have shown you in earlier sections, that there
does not exist any ontology with instances connected to the concepts and axioms based on
the concepts. Furthermore, as I was able to process the Disease Ontology according to my
requirements, there was another missing component, the text corpora based on the Disease
Ontology. That is why I was not able to extract decent enough results through my proposed
architecture.
23
Chapter 6
Future Scope
As I mentioned earlier, the only module that can be improved in the above model is the Entity
Linking module. For this paper, I have used real-valued word embeddings and graph-based
algorithms to find the linking between a concept and an entity mention. Instead of using these
modules, we can also the neural models like Albert [20] or BioBert [21] that have shown very
promising results when normalizing the entity mentions using character-level embeddings. I was
not able to incorporate this model because of time constraints and little knowledge about the
fine-tuning of such models.
Apart from Entity Linking approaches, other approaches like pattern-matching can also replace
the Entity Linking Module. As of now, we have the heuristic rules for extracting hyponyms and
hypernyms, but we do not have such rules to extract union and intersection relations between
the same entity mentions in the text. Some of these rules can include some phrases like “as well
as”, “can also be”, etc. But, again it needs deep understanding and research for the syntactical
rules of the English language.
Lastly, [29]’s work can also be extended if we concentrate more on the transformation rules,
which creates DL axioms based on the syntactic structure of the dependency tree. Examining
the dependency structure of more complex English sentences in a recursive manner and applying
the transformation rule in the appropriate manner may give way for more complex DL axioms.
24
Bibliography
[1] Dome, http://dome.sourceforge.net/.
[2] Metathesaurus. National Library of Medicine (US), Sep 2009.
[3] Google Code Archive - Long-term storage for Google Code Project Hosting., Jun 2020.
[Online; accessed 6. Jun. 2020].
[4] Aronson, A. R., and Lang, F.-M. An overview of MetaMap: historical perspective and
recent advances. J. Am. Med. Inform. Assoc. 17, 3 (May 2010), 229.
[5] Asim, M. N., Wasim, M., Khan, M. U. G., Mahmood, W., and Abbasi, H. M. A
survey of ontology learning techniques and applications. Database 2018 (10 2018). bay101.
[6] Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedical
terminology. Nucleic Acids Res. 32, Database (Jan 2004), D267.
[7] Buhmann, L., Lehmann, J., and Westphal, P. Dl-learner - a framework for inductive
learning on the semantic web. Web Semantics: Science, Services and Agents on the World
Wide Web 39 (2016), 15 – 24.
[8] Chiu, B., Crichton, G., Korhonen, A., and Pyysalo, S. How to train good word
embeddings for biomedical NLP. In Proceedings of the 15th Workshop on Biomedical Nat-
ural Language Processing (Berlin, Germany, Aug. 2016), Association for Computational
Linguistics, pp. 166–174.
[9] Cimiano, P., and Volker, J. Text2onto. In Natural Language Processing and Infor-
mation Systems (Berlin, Heidelberg, 2005), A. Montoyo, R. Munoz, and E. Metais, Eds.,
Springer Berlin Heidelberg, pp. 227–238.
[10] Drymonas, E., Zervanou, K., and Petrakis, E. G. M. Unsupervised ontology acqui-
sition from plain texts: The ontogain system. In Natural Language Processing and Infor-
mation Systems (Berlin, Heidelberg, 2010), C. J. Hopfe, Y. Rezgui, E. Metais, A. Preece,
and H. Li, Eds., Springer Berlin Heidelberg, pp. 277–287.
[11] Faure, D. Acquisition of semantic knowledge using machine learning methods: The system
”asium.
25
[12] Fukuta, N., Yamaguchi, T., Morita, T., and Izumi, N. Doddle-owl: Interactive
domain ontology development with open source software in java. IEICE Transactions on
Information and Systems E91D (04 2008).
[13] Ghosh, M., Naja, H., Abdulrab, H., and Khalil, M. Ontology learning process as a
bottom-up strategy for building domain-specific ontology from legal texts. pp. 473–480.
[14] Guarino, N., Oberle, D., and Staab, S. What is an ontology? In Handbook on
Ontologies, S. Staab and R. Studer, Eds., International Handbooks on Information Systems.
Springer, 2009, pp. 1–17.
[15] Hahn, U., and Romacker, M. The syndikate text knowledge base generator. In Pro-
ceedings of the First International Conference on Human Language Technology Research
(2001).
[16] Hatala, M., GasIŒevicI, D., Siadaty, M., Jovanovic, J., and Torniai, C. Ontol-
ogy extraction tools: An empirical study with educators. IEEE Transactions on Learning
Technologies 5, 03 (jul 2012), 275–289.
[17] Jiang, X., and Tan, A.-H. Crctol: A semantic-based domain ontology learning system.
Journal of the American Society for Information Science and Technology 61, 1 (2010),
150–168.
[18] Karadeniz, I., and Ozgur, A. Linking entities through an ontology using word embed-
dings and syntactic re-ranking. BMC Bioinf. 20, 1 (Dec 2019), 1–12.
[19] Kibbe, W. A., Arze, C., Felix, V., Mitraka, E., Bolton, E., Fu, G., Mungall,
C. J., Binder, J. X., Malone, J., Vasant, D., Parkinson, H., Schriml, L. M.,
Kibbe, W. A., Arze, C., Felix, V., Mitraka, E., Bolton, E., Fu, G., Mungall,
C. J., Binder, J. X., Malone, J., Vasant, D., Parkinson, H., and Schriml, L. M.
Disease Ontology 2015 Update: An Expanded and Updated Database of Human Diseases
for Linking Biomedical Knowledge Through Disease Data. Nucleic Acids Res. 43, Database
(Jan 2015), issue.
[20] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv
(Sep 2019).
[21] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. BioBERT:
a pre-trained biomedical language representation model for biomedical text mining. arXiv
(Jan 2019).
[22] Lenci, A., Montemagni, S., Pirrelli, V., and Venturi, G. Ontology learning from
italian legal texts. In Proceedings of the 2009 Conference on Law, Ontologies and the
Semantic Web: Channelling the Legal Information Flood (Amsterdam, The Netherlands,
The Netherlands, 2009), IOS Press, pp. 75–94.
26
[23] Maedche, A., Maedche, E., and Staab, S. The text-to-onto ontology learning envi-
ronment. In Software Demonstration at ICCS-2000 - Eight International Conference on
Conceptual Structures (2000).
[24] Marneffe, M.-C., MacCartney, B., and Manning, C. Generating typed dependency
parses from phrase structure parses. vol. 6.
[25] Miller, G. A. Wordnet: A lexical database for english. Commun. ACM 38, 11 (Nov.
1995), 39–41.
[26] Oliveira, A., Pereira, F. C., and Cardoso, A. Automatic reading and learning from
text, 2001.
[27] Pennington, J., Socher, R., and Manning, C. GloVe: Global vectors for word
representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP) (Doha, Qatar, Oct. 2014), Association for Computational
Linguistics, pp. 1532–1543.
[28] Velardi, P., Faralli, S., and Navigli, R. Ontolearn reloaded: A graph-based algo-
rithm for taxonomy induction. Computational Linguistics 39 (09 2013), 665–707.
[29] Volker, J., and Rudolph, S. Lexico-logical acquisition of owl dl axioms: an integrated
approach to ontology refinement. pp. 62–77.
[30] Wachter, T., Fabian, G., and Schroeder, M. Dog4dag: Semi-automated ontology
generation in obo-edit and protEgE. In Proceedings of the 4th International Workshop on
Semantic Web Applications and Tools for the Life Sciences (New York, NY, USA, 2012),
SWAT4LS ’11, ACM, pp. 119–120.
[31] Wong, W., Liu, W., and Bennamoun, M. Ontology learning from text: A look back
and into the future. ACM Comput. Surv. 44 (2012), 20:1–20:36.
[32] Zhang, Y., Chen, Q., Yang, Z., Lin, H., and Lu, Z. BioWordVec, improving biomed-
ical word embeddings with subword information and MeSH. Sci. Data 6, 52 (May 2019),
1–9.
27