Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto

IOSR Journal of Computer Engineering (IOSR-JCE)

e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 11, Issue 2 (May. - Jun. 2013), PP 101-117 www.iosrjournals.org

www.iosrjournals.org 101 | Page

Tools for Ontology Building from Texts: Analysis and

Improvement of the Results of Text2Onto

Sonam Mittal1, Nupur Mittal

2

1Computer Science, B.K. Birla Institute of Engineering & Technology, Pilani, Rajasthan, India 2Computer Science, Ecole Polytechnique de l’Universite de Nantes, France

Abstract: Building ontologies from texts is a difficult and time-consuming process. Several tools have

been developed to facilitate this process. However, these tools are not mature enough to automate all

tasks to build a good ontology without human intervention. Among these tools, Text2Onto is a one for

learning ontology from textual data. This case study aims at understanding the architecture and

working principle of Text2Onto, analyzing the errors that Text2Onto can produce and finding a solution

to reduce human intervention as well as to improve the result of Text2Onto.Three texts of different length

were used in the experiment. Quality of Text2Onto results was assessed by comparing the entities

extracted by Text2Onto with the ones extracted manually. Some causes of errors produced by

Text2Onto were identified too. As an attempt to improve the result of Text2Onto, change discovery

feature of Text2Onto was used. Meta- model of the given text was fed to Text2Onto to obtain a POM

on top of which an ontology was built for the existing text. The meta-model ontology was aimed to identify all the core concepts and relations as done in the manual ontology and the ultimate

objective was to improve the hierarchy of the of the ontology. The use of meta model should help to

better classify the concepts under various core concepts.

Keywords: Ontology, Text2Onto

I. Introduction In the current scenario, use of domain ontology has been increasing. To make such domain

ontologies, general method used is extracting ontology from textual resources. It involves processing of

huge amount of texts which makes it a difficult and time-consuming task. In order to expedite the process

and support the ontogists in different phases of ontology building process, several tools based on

linguistic or statistical techniques have been developed. However, the tools are not fully automated yet.

Human intervention is required at some phases of the tools to validate the results of the tools so as to produce a good result. Such human intervention is not only time consuming but also error-prone.

Therefore, minimizing human activities for error correction is a key for enhancing these tools.

Text2Onto is a framework for learning ontologies from textual data. It can extract different

ontology components like concepts, relations, instances, hierarchy etc from documents. It also gives some

statistical values which help to understand the importance of those components in the text. However,

users have to verify its results. We, therefore, studied this tool in order to assess how relevant its results

are and to check if its result can be improved. For this purpose, first of all, architecture and working

principles of Text2Onto were studied. Then we performed some experiments. To assess the results, we

mainly considered concepts, instances and relations. We also observed taxonomy. However, the detailed

study revolved around these three components.

II. Literature Review This section gives brief overview of Ontology, Ontology building processes and sums up the papers [1], [3],

[4], [5], [6], [7].

2.1 Ontology

An ontology is an explicit, formal specification (i.e. machine readable) of a shared (accepted by a

group or community) conceptualization of a domain of interest [2]. It should be restricted to a given

domain of interest and therefore model concepts and relations that are relevant to a particular task or

application domain. Ontologies are built to be reused or shared anytime, anywhere and

independently of the behavior and domain of the application that uses them. The process of instantiating the a knowledge base is referred to as ontology population whereas the automatic

support in ontology development is usually referred to as ontology learning. Ontology learning is

concerned with knowledge acquisition.

Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto


2.2 Ontology life cycle

Ontology development process refers to what activities are carried out to build the ontologies from

scratch.[1] In order to start the ontology development process, there is a need to plan out the activities to be carried out and the resources used for them. Thus an ontology specification document is prepared in order

to write the requirements and the specifications of the ontology development process. The process of

ontology building starts with conceptualization of the acquired knowledge in a conceptual model in

order to describe the problem and its solution with the help of some intermediate representations. Next,

the conceptual models are formalized into formal or semi-compatible formal models using frame-oriented

or Description Logic (DL) representation systems. The next step is to integrate the current ontology with

the existing ontologies. Though t h i s is an optional step, we should consider reusing existing ontologies in

order to avoid duplicate effort in building them. After this, the ontology is implemented in a formal

language like OWL, RDF etc. Once the ontology is implemented, it is evaluated to make a technical

judgment with respect to a frame of reference. There is a need to document the ontology to the best

possible extent. Finally, efforts are put to maintain and update the ontology. There can be various ways to follow these activities to develop the ontology. The most common among

them are water fall life cycle and incremental life cycle.

III. Methontology Methontology [1] is a well-structured methodology used to build ontologies from scratch. It

follows a certain number of well-defined steps to guide the ontology development process. Methontology

follows the order of specification, knowledge acquisition, conceptualization, implementation, evaluation and

documentation activities in order to carry out the ontology development process. It also identifies the

management activities like schedule, control and quality assurance and some support activities like integration and evaluation.

3.1 Specification

The first phase according to Methontology is specification where an ontology specification

document is a formal or semi-formal document written in natural language (NL) having information

like purpose of the ontology, level of formality implemented in the ontology, scope of ontology and source

of knowledge. A good design of this document is the one where each and every term is relevant and has

partial completeness and ensures consistency of all the terms.

3.2 Knowledge Acquisition

The specification is followed by knowledge acquisition, which is an independent activity performed

using techniques like brainstorming, interviews, formal questions, non-structured interviews, informal text analysis, formal text analysis, structured interviews and knowledge acquisition tools.

3.3 Conceptualization

The next step is structuring the domain knowledge in a conceptual model. This is the step of

conceptualization where a glossary of terms is built, relations are identified, taxonomy is defined, the data

dictionary is implemented and table of rules and formula is made. Data dictionary describes and gathers

all the useful and potentially usable domain concepts, their meanings, attributes, instances, etc. Table of

instance attributes provide information about the attribute or about its values at the instance. Thus the

result of this phase of Methontology is a conceptual model expressed as a set of well-defined deliverables

which allow to access the usefulness of the ontology and to compare the scope and completeness of various

other ontologies.

3.4 Integration

Integration is an optional step that is used to accelerate the process of building ontology by

merging various already existing related ontologies. This leads to inspection of the meta-ontologies and then

to find out the best suited libraries to provide term definition. As a result, Methontology produces an

integration document summarizing the meta-ontology, the name of the terms to be used from conceptual

model and the name of the ontology from which the corresponding definition is taken. Methontology highly

recommends the use of already existing ontologies.

3.5 Implementation

Implementation of the ontology is done using a formal language and an ontology development

environment which is incorporated with a lexical and syntactic analyzer so as to avoid lexical and syntactic errors.



3.6 Evaluation

Once the ontology has been implemented, they are judged technically which results in a small

evaluation document where the methods used to evaluate the ontology will be described.

3.7 Documentation

Documentation should be carried out during all the above steps. It is the summing of the steps,

procedures and results of each step in a written document.

IV. Ontology Learning Layers Different aspects of Ontology Learning (OL) have been presented in the form of a stack on the

paper [6]. OL involves the processing of different layers of this stack. It follows an order of identifying the

terms (linguistic realizations of domain-specific concepts), finding out their synonyms, categorizing them as concepts, defining concept hierarchies, relations and describing rules in order to restrict the concepts.

Different ontology components and the methods for extracting them are explained in the following

sections in details.

V. Ontology modeling components Methontology deals to conceptualize ontologies with a tabular and graphical IRs. The components

of such IRs are: Concepts, Relations between the concepts of the domain, Instances (specialization of

concept), Constants, Attributes (properties of the concepts in general and instances in specification),

formal axioms and rules specified in formal or semi-formal notation using DL. These components are used to conceptualize the ontologies by performing certain tasks as proposed by Methontology.

5.1 Term

Terms are linguistic realizations of domain-specific concepts. Term extraction is a mandatory step

for all the aspects of ontology learning from text. The methods for term extraction are based on

information retrieval, NLP research and term indexing. The state-of-the art is mostly to run a part-of-

speech tagger over the domain corpus and then to manually verify the terms hence constructing ad-hoc

patterns. In order to automatically identify only relevant terms, a statistical processing step can be used

that compares the distribution of terms between corpora.

5.2 Synonym Finding the synonyms allows the acquisition of the semantic term variants in and between languages

and hence helps in term translation. The main implementation is by integrating WordNet for getting the

English synonyms. This requires word sense disambiguation algorithms to identify the synonyms according

to the meaning of the word in the phrase. Clustering and related techniques can be another alternative for

dynamic acquisition. Two main approaches [6] are:

1. Harris Distribution Hypothesis: Terms are similar in meaning to the extent in which they share

syntactic contexts.

2. Statistical information measures defined over the web.

5.3 Concept

In identification of concept should focus to provide:

1. Definition of the concept. 2. Set of concept instances i.e. its extensions.

3. A set of linguistic realizations of the concept.

Intentional concept learning includes extraction of formal and informal definitions. An informal

definition can be a textual description whereas the formal description includes the extraction of concept

properties and relations with other concepts. OntoLearn system can be used for this purpose.

5.4 Taxonomy

Three main factors exploited to induce taxonomies are:

1. Application of lexico-syntactic patterns to detect hyponymy relations.

2. Context of synonym extraction and term clustering mainly using hierarchical clustering.

3. Document based notation of term subsumption.

5.5 Relation

Relations represent a type of association between concepts of the domain. Text mining using

statistical analysis with more or less complex levels of linguistic analysis is used for extracting relations.



Relation extraction is similar to the problem of acquiring selection restrictions for verb arguments in

NLP. Automatic content extractor program is one such program used for this purpose.

5.6 Rule

These are used to infer knowledge in the ontology. The important factor for rule extraction is to

learn lexical entailment for application in question answering systems.

5.7 Formal Axiom

Formal axioms are the logical expressions that are always true and are used as constraints in

ontology. The ontologist must identify the formal axioms needed in the ontology and should describe them

precisely. Information like Name, natural language description and logic expression should be identified

for each formal axiom.

5.8 Instance Relevant instances must be identified from the concept dictionary in an Instance table. NL tagger

can be used in order to identify the proper nouns and hence the instances.

5.9 Constant

Constants are numeric values that do not change during the time.

5.10 Attribute

Attributes describe the properties of instances and concepts. They can be instance attributes or class

attributes accordingly. Ontology development tools usually provide predefined domain-independent class

attributes for all the concepts.

VI. Ontology tools and frameworks Several tools and frameworks have been developed to aid the ontologist in different steps of

ontology building. Different tools are available for extracting ontology components from different kinds of

sources like text, semi structured text, dictionary etc. The scope of these tools varies from basic linguistic

processing like term extraction, tagging etc to guiding the whole ontology building process. Some of the

ontology tools and frameworks are discussed in the following section. As the scope of this study is limited

to Text2Onto, we will discuss about it in detail. Other tools are presented briefly.

VII. Text2Onto Text2Onto [7] is a framework for learning ontologies from textual data. It is a redesign of

TextToOnto and is based on Probabilistic Ontology Model (POM) which stores the learned primitives

independent of a specific Knowledge Representation (KR) language. It calculates a confidence for each

learned object for better user interaction. It also updates the learned knowledge each time the corpus is

changed and avoids processing it by scratch. It allows for easy combination and execution of algorithms

as well as writing new algorithms.

7.1 Architecture and Workflow

The main components of Text2Onto are Algorithms, an Algorithm Controller and POM. The

learning algori thms are initialized by a controller which triggers the linguistic preprocessing of the data. Text2Onto depends on the output of Gate. During preprocessing, it calls the applications of Gate to

i. tokenize the document (identifying words, spaces, tabs, punctuation marks etc)

ii. split sentences

iii. tag POS

iv. match JAPE patterns to find noun/verb phrases

Then the algorithms use the results from these applications.

Gate stores the results in an object called Annotation Set which is a set of Annotation objects.

Annotation object stores the following information: a. id - unique id assigned to the token/element

b. type - type of the element (Token, SpaceToken, Sentence, Noun, Verb etc)

c. features - a map of various info like whether it is a stopword or not, the category( or tag) of the

element (e.g. NN), etc.



d. start offset - Starting position of the element.

e. end offset - ending position of the element.

Text2Onto uses the „type‟ property to filter the required entity and then uses start and end offset to find

the actual word. For e.g. suppose our corpus begins with the following line:

Ontology evaluation is a critical task. . .

Then the information of a word „task‟ is stored in Annotation object with type „Token‟, category „NN‟,

start offset „34‟ and end offset „38‟. Text2Onto uses the offset values to get the exact word again.

After preprocessing the corpus, the controller executes the ontology learning algorithms in the

appropriate order and applies the algorithms‟ change requests to the POM.

The execution of algorithms takes place in three phases notification phase, computation phase and

result generation phase. In the first phase, the algorithm learns about recent changes to the corpus. In the

second phase, these changes are mapped to changes with respect to the reference repository and finally,

requests for POM changes are generated from the updated content of the reference repository. Text2Onto includes a Modeling Primitive Library (MPL) which makes the primitive models Ontology

language independent.

7.2 POM

POM (Probabilistic Ontology Model also called Preliminary Ontology Model) is the basic

building block of Text2Onto. It is an extensible collection of modeling primitives for different types of

ontology elements or axioms and uses confidence and relevance annotations for capturing uncertainty. It is

KR language- independent and thus can be transformed into any reasonably expressive knowledge

representation language such as OWL, RDFS, F-logic etc. The modeling primitives used in Text2Onto

are as follows:

i. concepts (CLASS)

ii. concept inheritance (SUBCLASS-OF) iii. concept instantiation (INSTANCE-OF)

iv. properties/relations (RELATION)

v. domain and range restrictions (DOMAIN/RANGE)

vi. mereological relations

vii. equivalence

POM is traceable because for each object, it also stores a pointer to those parts of the document

from which it was derived. It also allows maintenance of multiple modeling alternatives in parallel.

Adding new primitives does not imply changing the underlying framework thus making it flexible and

extensible.

7.3 Data-driven Change Discovery

An important feature of Text2Onto is data-driven change discovery which prevents the whole

corpus from being processed from scratch each time it changes. When there are changes in the corpus,

Text2Onto detects the changes and calculates POM deltas with respect to the changes. As POM is

extensible, it modifies the POM without recalculating it for the whole document collection. The benefits

of this feature are that the document reprocessing time is saved and the evolution of the ontology can be

traced.

7.4 Ontology Learning Algorithms/Methods

Text2Onto combines Machine Learning approaches with basic linguistics approaches for learning

ontology. Different modeling primitives in POM are instantiated and populated by different algorithms.

Before populating POM, the text documents undergo linguistic preprocessing which is initiated by the algorithm controller. Basic linguistic preprocessing involves tokenization, sentence splitting, syntactic

tagging of all the tokens by POS tagger and lemmatizing by morphological analyzer or stemming by a

stemmer. The output of these steps is an annotated corpus which is then fed to JAPE transducer to

match a set of particular patterns required by the ontology learning algorithms. The algorithms use certain

criteria to evaluate the confidence of the extracted entities. The following section presents the techniques

and criteria used by these algorithms to extract different ontology components.

7.4.1 Concepts

Text2Onto comes with three algorithms for extracting concepts EntropyConceptExtraction,

RTFConceptExtraction and TFDIFConceptExtraction. It looks for the type „Concept‟ in the Gate results.



All of these algorithms filter the same type. The only difference is the criteria they take for the probability

/ relevance calculation. These algorithms use statistical measures such as TFIDF (Term Frequency Inverted

Document Frequency), Entropy, C-value, NC-value, RTF (Relative Term Frequency). For each term, the values of these measures are normalized to [0...1] and used as corresponding probability in the POM.

1. RTFConceptExtraction

It calculates Relative Term Frequency which is obtained by dividing the absolute term frequency

(number of times a term t appears in the document d) of the term t in the document d divided by the

maximum absolute term frequency (the number of times any term appears the maximum number of times in

the document d) of the document d.

𝑡𝐟(𝐭,𝐃) = 𝐚𝐛𝐬𝐨𝐥𝐮𝐭𝐞 𝐭𝐞𝐫𝐦 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲

𝐦𝐚𝐱𝐢𝐦𝐮𝐦 𝐚𝐛𝐬𝐨𝐥𝐮𝐭𝐞 𝐭𝐞𝐫𝐦 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲

2. TFIDFConceptExtraction

It calculates term frequency inverse document frequency which is the product of TF (term

frequency) and IDF (Inverse Document Frequency). IDF is obtained by dividing the total number of

documents by the number of documents containing the term, and then taking the log of that quotient.

tf -idf (t, d, D) = tf (t, d) × idf (t, D)

where,

𝒊𝒅𝒇 𝒕,𝑫 = 𝒍𝒐𝒈 𝑫

𝒅𝒇 𝒕

|D| = number of all documents

df (t) = Number of documents containing the term.

3. EntropyConceptExtraction

It computes entropy which is a combination of C-value (indicator of termhood) and NC-value

(Contextual indicators of termhood)

C-value (frequency-based method sensitive to multi-word t er m s)

𝐂−𝐯𝐚𝐥𝐮𝐞 𝐚 =

𝐥𝐨𝐠𝟐 𝐚 𝐟 𝐚 𝐢𝐟 𝐚 𝐢𝐬 𝐧𝐨𝐭 𝐧𝐞𝐬𝐭𝐞𝐝

𝐥𝐨𝐠𝟐 𝐚 𝐟 𝐚 − 𝟏

𝐓𝐚 𝐟(𝐛)

𝐛𝛜𝐓𝐚

f (a) is the frequency of a, Ta is the set of terms which contain a.

NC-value (incorporation of information from context wor ds indicating t ermh ood)

𝐰𝐞𝐢𝐠𝐡𝐭 𝐰 = 𝐭(𝐰)

𝐧

where t(w) is the number of times that w appears in the context of a term.

7.4.2 Instances

An algorithm called TFIDFInstanceExtraction is available in Text2Onto for extraction of

instances. It filters “Instance” type from the gate result and computes TFIDF as in

TFIDFConceptExtraction.

7.4.3 General relations

General relations are identified using linguistic approach. The algorithm SubcatRelationExtraction

filters the types “TransitiveVerbPhrase”, “IntransitivePPVerbPhrase”, and “ TransitivePPVerbPhrase”

in the Gate results which is obtained by shallow parsing to identify the following syntactical frames:

• Transitive, e.g., love (subj, obj) • Intransitive + PP-complement, e.g., walk (subj, pp (to))

• Transitive + PP-complement, e.g., hit (subj, obj, pp (with))

For each verb phrases, it finds its subject, object and associated preposition. (By filtering Nouns and

Verbs from the sentence) and then stems them and prepares the relation.

7.4.4 Subclass-of relations

Subclass-of relations identification involves several algorithms which use hypernym structure of



WordNet, match Hearst patterns and apply linguistic heuristics. The results of these algorithms are

combined through combination strategies. These algorithms depend on the result of concept extraction

algorithms. Relevance calculation of one of the algorithms is presented below: 1. WordNetClassifcationExtraction

It extracts subclass-of relations among the extracted concepts identifying the hypernym structure of the

concepts in WordNet. Relevance is calculated in the following manner:

If a is a subclass of b, then

𝐑𝐞𝐥𝐞𝐯𝐚𝐧𝐜𝐞= 𝐍𝐨.𝐨𝐟 𝐬𝐲𝐧𝐨𝐧𝐲𝐦𝐬 𝐨𝐟 𝐚 𝐟𝐨𝐫 𝐰𝐡𝐢𝐜𝐡 𝐛 𝐢𝐬 𝐚 𝐡𝐲𝐩𝐞𝐫𝐧𝐲𝐦

𝐍𝐨.𝐨𝐟 𝐬𝐲𝐧𝐨𝐧𝐲𝐦𝐬 𝐨𝐟 𝐚

7.4.5 Instance-of relations

Lexical patterns and context similarity are taken into account for instance classification. A pattern-

matching algorithm similar to the one use for discovering mereological relations is also used for instance-

of relation extraction.

7.4.6 Equivalence and equality

The algorithm calculates the similarity between terms on the basis of contextual features

extracted from the corpus.

7.4.7 Disjointness

A heuristic approach based on lexico-syntactic patterns is implemented to learn disjointness. The algorithm learns disjointness from the patterns like NounPhrase1, NounPhrase2.... (and/or)

NounPhrasen.

7.4.8 Subtopic-of relations

Subtopic-of relations are discovered using a method for building concept hierarchies. There is also

an algorithm for extracting this kind of relationships from previously identified subclass-of relations.

7.5 NeOn Toolkit

NeOn Toolkit is an open source multi-platform ontology engineering environment and provide

comprehensive support for ontology engineering lifecycle. It is based on Eclipse platform and provides

various plugins for different activities in ontology building. Following plugins are under the scope of this

case study:

7.5.1 Text2Onto plug-in

It is a graphical front-end for Text2Onto that is available for the NeOn toolkit. It enables the

integration of Text2Onto into a process of semi-automatic ontology engineering.

7.5.2 LeDA Plugin

LeDA, an open source framework for automatic generation of disjointness axioms, has been

implemented in this plug-in developed to support both enrichment and evaluation of the acquired

ontologies. The plug-in facilitates a customized generation of disjointness axioms for various domains by

supporting both the training as well as the classification phase.

7.6 Ontocase

OntoCase is an approach to use ontology patterns throughout an iterative ontology construction

and evolution framework. In OntoCase the patterns constitute the backbone of these reusable solutions

because they can be utilized directly as solutions to specific modeling problems. The central repository

consists of pattern catalogue, ontology architecture and other reusable assets. The OntoCase cycle consists of 4 phases, Retrieval, Reuse, Evaluations and revision and Discovery of new pattern candidates. The first

phase corresponds to input analysis and pattern retrieval. It constitutes the process of analyzing the input

and matching derived input representation to the pattern base to select appropriate pattern. The second

phase includes pattern specialization, adaptation and composition and constitutes the process of reusing

the retrieved patterns and constructing an improved ontology. The third one concerns evaluation and

revision of the ontology to improve the fit to the input and the ontology quality. The final phase includes

the discovery of new pattern candidates or the other reusable components as well as storing pattern

feedback.



VIII. Learning disjointness axioms (LeDA)

LeDA is an open-source framework for learning disjointness [3] and is based on machine

learning classifier called Naive Bayes. The classifier is trained based on a vector of feature values and manually created disjointness axioms (i.e. a pair of classes labeled „disjoint‟ or „not disjoint‟). The

following features are using in this framework:

Taxonomic overlap: Taxonomic overlap is the set of common individuals.

Semantic distance: The semantic distance between two classes c1 and c2 is the minimum length of a

path consisting of subsumption relationships between atomic classes that connects c1 and c2.

Object properties: This feature encodes the semantic relatedness of two classes, c1 and c2, based on

the number of object properties they share.

Label similarity: This feature gives the semantic similarity between two classes based on common

prefix or suffix shared by them. Levenshtein edit distance, Q-grams and Jaro-Wrinkler distance are taken

into account to calculate label similarity in LeDA.

Wordnet similarity: LeDA uses Wordnet-bases similarity measure that computes the cosine similarity between vector-based representations of the glosses that are associated with the two synsets.

Features based on Learned Ontology: From the already acquired knowledge such as terminological

overlap, classes, individuals, subsumption and class membership axioms, more features, viz. subsumption,

taxonomic overlap of subclasses and instances and lexical context similarity, are calculated.

IX. LExO for Learning Class Descriptions LExO (Learning Expressive Ontologies) [3] automatically generates DL axioms from natural

language sentences. It analyzes the syntactic structures of the input sentence and generates

dependency tree which is then transformed into XML-based format and finally to DL axioms by means

of manually engineered transformation rules. However, this automation of DL generation needs human

intervention to verify if all of them are correct.

X. Relexo Relational Exploration for Learning Expressive Ontologies is a tool used for the difficult and

time-consuming phase of ontology refinement [4]. It not only supports the user in a stepwise refinement of

the ontology but also helps to ensure the compatibility of a logical axiomatization with the user‟s

conceptualization. It combines a method for learning complex class descriptions from textual definitions

with the Formal Concept Analysis (FCA)-based technique of relational exploration. The LExO

component of this assists the ontologist in the process of axiomatizing atomic classes; the exploration

part helps to integrate newly acquired entities into the ontology. It also helps the user to detect

inconsistencies or mismatches between the ontology and her conceptualization and hence provides a stepwise approximation of the user‟s domain knowledge.

XI. Alignment To Top-Level Ontologies It is a special case of ontology matching where the goal is to primarily find correspondences

between more general concepts or relations in the top-level ontology and more specific concepts and relations

on the engineered ontology. Aligning Ontology to a top-level ontology might also be compared to

automatically specializing or extending a top-level ontology. Methods like lexical substitution may be used

to find clues of whether or not a more general concept is related to a more specific one in the other

ontology the alignment of ontology to a top-level ontology engineering patterns. By determining that a pattern can be applied and applying it then provides a connection to the top-level ontology.

XII. Experiment

In order to evaluate the results of Text2Onto and improve them, some experiments were carried out. The

objectives of the experiments were

• To analyze the various algorithms and criteria used by Text2Onto for extracting different

ontology components.

• To analyze the result produced by Text2Onto

• To compare the components extracted by Text2Onto with the ones extracted manually.

• To analyze errors found in the ontology built by Text2onto and identifying their origin. • To analyze Text2Onto outcomes when adding meta-model of the ontology as an additional input.

Details on the experimental data and the experiment protocol are presented in the following sections.



XIII. Experimental Data The experiments were conducted for three individual texts. The first text which we will call

„Abstract‟ onwards was a compilation of abstract of four different papers. The remaining texts will be

referred to as „Text1‟ and „Text2‟. All of these texts were related to Ontology building and ontology learning

tools. Ontologies were built manually from these texts as well as from Text2Onto.

XIV. Experimental Protocol The experiments were performed in five phases. The first phase involved the building of ontology

manually from the three texts. The second phase was concerned with the development of ontology using Text2Onto. In the third phase, the ontology built by Text2Onto was compared with the manual one. In

the next phase, meta-model of the texts were fed to Text2Onto and the corresponding ontology was built

again. Finally, the results were compared with the older ontologies. These phases are further described in

details in the following section:

14.1 Experimental Work-flow

The following steps were carried out for each text:

1. Building ontology manually

Methontology was followed to build ontologies from the three texts manually. All the steps like

glossary building, meta-model and taxonomy were followed while building ontology from Abstract and

Text2 whereas the ontology of Text1 was provided to us. The ontology was conceptualized in the following

way: 1. POS tagging of all the terms in the document.

2. Identify the concepts and relation from the validated terms.

3. Making the meta-model.

The aim is to subsume all the accepted concepts into some of the core concepts.

4. Identifying the accepted terms (concepts), their related core-concepts and finding their synonyms.

5. Defining the is-a hierarchy for the concepts and the identified core-concepts.

6. Identifying other binary relations.

7. Validating the meta-model.

2. Building ontology using Text2Onto

This step involved the use of Text2Onto to build the same ontology automatically.

3. Analysis of Text2Onto results

The Analysis phase was itself done in two phases. First, the results of different algorithms of

Text2Onto were compared with each other in order to find the interesting criteria for the extraction of

different components. This was done for concepts, instances, relation and hierarchy extraction. The main

criteria for the comparison were the relevance value.

Secondly, a comparison and study of differences between the results of tasks performed in the previous

two phases were carried out to estimate and comment on the quality of the ontology built by the tool.

The comparison was very detailed in the sense that all concepts, instances, relations and hierarchies

extracted from these two methods were compared. It was followed by the identification of causes for the

differences and errors/shortcomings in the performance of the tool.

4. Adding Meta-model to the ontology using Text2Onto

The idea was to observe if Text2Onto gives better results when ontology is built on top of its

meta- model. For this, the meta-model built manually in the first phase was introduced into Text2Onto

and ontologies were built upon their corresponding meta-model. This process involved the following

steps:

(a) Conversion of the meta model into text

In order to get a POM of meta-model, we converted meta-model into text from which Text2Onto can

extract core concepts and relations between them. Details about the process of conversion are given in the

section 16Conversion of Meta-Model to text.

(b) Obtaining meta model POM

The meta model text was fed to Text2Onto to obtain a meta model POM which contained all core concepts and relations between them.

(c) Improving the ontology using meta-model

Once the POM has been obtained from Text2Onto, the original text was added to it to build a new

ontology combined with the meta model.



5. Comparison of the ontology built with and without the meta model

In this phase, the ontology build in the second phase was compared with the one built using meta

model. Relevance values, identification of new components and hierarchies were considered while comparison.

XV. Results And Observations 15.1 Comparison of Algorithms and criteria of Text2Onto

The algorithms and criteria used by Text2Onto for extracting ontology components were

studied in detail so as to compare their performance. The comparison was done based on the relevance

values computed by these algorithms.

15.1.1 Observations Though the values of relevance in case of entropy are different from those in case of other

algorithms, they hold the similar relations and the relative values for the concepts. Same is also true with

the combination of one or more such evaluation algorithms. It was observed that the order of the

extracted components is independent of the algorithms/criteria used. So we cannot say if one algorithm

is superior to the others or one criterion is better than the others. We observed the same behavior in all

three texts.

XVI. Conversion Of Meta-Model To Text In order to try to improve the ontology built by the tool Text2Onto, the meta-model is used and is

translated to text. As concepts and relations of meta-model should be all identified when executed with

the tool, first try was to write a paragraph about the meta-model. This worked fine for most of the concepts but a very few relationships could be identified and some of the concepts were also left out and

some extra concepts were included (which were used in the paragraph to structure the meta-model

t r a n s l a t i on ). The next try was to write simple sentences consisting of two nouns (the concepts) related

by a verb (the relation between the two concepts). We tried to use the core concepts and relations only

from the text as much as possible. However, this also could not identify all the relations properly. Finally a

new algorithm was proposed so as to achieve the desired goal as well as to enhance the results of

Text2Onto. Below are the translations of meta model for the various experimental data used.

16.1 Abstract Text

The meta model of this text is given in the figure 1. For this meta model, we used the following lines to

construct meta model POM in Text2Onto.

A system is composed of methods. A method has method components.

A tool implements methods.

An algorithm is used by methods.

An expert participates in ontology building step.

Ontology building step uses resources.

A resource is stored in data repository.

A term is included in resources.

Ontology building step is composed of ontology building process.

Ontology has ontology components.

A user community uses ontologies.

Ontology describes domain.



Figure 1: Abstract-Text Meta Model

16.2 Text1

The meta model of this text is given in the figure 2.

Figure 2: Text1 Meta Model

16.3 Text2

The meta model of this text is given in the figure 3 and the corresponding meta-model text is given

below.

Domain has ontology.

Ontology is composed by ontology components. Ontology is built by methodology.

Tool builds ontology.

Activity is guided by methodology.

Activity produces model.

Representation is resulted by mode

Tool supports activity.

Organization develops tool.

Methodology is developed by organization.

Tool uses language.



Person uses tool.

Person creates ontology.

Figure 3: Text2 Meta Model

16.4 Comparison of Manual and Automated Ontologies

This sections includes the comparison of the two methods of ontology building i.e. MANUAL

and AUTOMATED with the tool Text2Onto. The aim of the comparison is to evaluate the process of

ontology building by the tool and then analyze the results to suggest improvements to the tool.

16.4.1 Manual Ontology - Abstract

Abstract text was the shortest of all texts. It had 536 terms in total out of which 34 terms were

accepted as concepts and 9 as instances.

16.4.2 Automated Ontology - Abstract

The same text was fed to Text2Onto for automating the process of ontology building. As the

importance of ontology components based on relevance values was found to be independent of the

algorithms used, we could choose any algorithm from the available list of them. As we were extracting

ontology from a single document, the algorithms that use TFIDF criteria was not interesting for us. So,

we didn‟t choose this algorithm during analysis. The evaluation algorithms used in the Text2Onto gave

the relevance values to the concepts and other components identified. Text2Onto did not support writing the results in a separate file and hence we added another

method that could save the results in a different excel file for each execution of Text2Onto. This was also

necessary for the later phases of comparison.

Text2Onto extracted 85 concepts, 14 individuals, and 3 general relations.

16.4.3 Comparison of manual and automated ontology - Abstract

The two ontologies were compared ma jor l y based on the identified concepts, instances, and

relations. Out of 34 concepts extracted manually, only 26 matched the ones extracted from Text2Onto.

Only 7 instances were common to both ontologies and none of the relations were common to them. We

observed that the manual ontology was better in identifying the concepts because in the ontology made

by Text2Onto some of the irrelevant concepts were also considered. Another major problem was the identification of the composite concepts. All the composite concepts (consisting of more than one atomic

word) were not identified unlike the manual ontology. Relations were not at all satisfactory.

The possible reasons attributed for these differences are as follows:



1. The text was not consistent as a whole.

The text was basically a summarization of different texts and hence it lacked synchronization between its

different paragraphs. Thus there was a need to try with another longer and better text so as to conclude anything significant.

2. The frequency for most of the terms (concepts and relations) was very less.

16.4.4 Manual ontology - Text1

For this ontology, there were 4807 terms after tokenization, of which, 472 were nouns and 226 were

verbs. After per forming the operation of stemming, the number of nouns was reduced to 357 as close as

25% reduction in comparison with the original count.

16.4.5 Automated ontology - Text1

The Text1 was fed to Text2Onto for making the ontology automatically. 406 concepts, 94

instances and 16 relations were extracted from Text2Onto.

16.4.6 Comparison of manual and automated ontologies - Text1

As compared to 357 terms from the manual ontology, Text2Onto extracted 406 terms. Among

them only 87 concepts were common to both of them. Some highly irrelevant terms were also included in

the results of Text2Onto based on their high relevance values. On the other hand, some important composite

terms were missed out from the results of automated ontology.

16.4.7 Manual ontology - Text2

Following the same procedure as above for building the manual ontology, there were 4761 terms in

the knowledge base. Finally 667 valid terms were refined from this knowledge base of which ultimately

200 terms were accepted as concepts of the ontology.

16.4.8 Automated ontology - Text2

350 terms (concepts) were extracted from this text when it was run with Text2Onto. A lot of

concepts were insignificant and had to be rejected when the comparison was made.

16.4.9 Comparison of Manual and Automated Ontologies

This automated ontology was better than the earlier too as it could identify many relations

and the is-a hierarchy was better than the others.

16.4.10 Observations

Relevance Values and their roles

In order to assess the result of Text2Onto and possibility to automate the process of ontology building, we examined the role of relevance values for concepts in Text2Onto. The following

observations were made regarding the same:

Most of the terms that were extracted by Text2Onto as concepts can be accepted based on

their relevance values.

The core concepts generally have very high relevance.

Most of the terms with high relevance value are accepted.

There are concepts which are always rejected despi te of their very high values. After studying

m a n y papers and previous works in this field, there is no general rule that can be applied to

automatically reject these terms but some corpus specific rules can be written.

There are concepts which are accepted despite of their low values. In order to automate the third

and fourth process, we tried to find out some information about these kinds of concepts. We observed that the terms with high relevance values (which are generally rejected) occur in the same

kind of pattern. For example the concept is „ORDER‟. It is generally observed to appear a s “IN

ORDER T O ”. Thus predefining many such patterns to exclude can be one solution to reject some

terms despite their high relevance values.

16.5 Analysis of errors

16.5.1 Identification of errors

Following errors were identified while comparing the ontologies built manually and the ones built

usingText2Onto:

1. Some concepts were also identified as instances by Text2Onto. For e.g. ontology, WSD

2. Acronyms were not identified by Text2Onto. E.g. SSI, POM.



3. Synonyms were not identified properly.

4. Very few relations we r e identified by Text2Onto most of which were not appropriate (interesting)

at all. 5. Instance-of algorithm did not give the instances that are given by instance a lgor ithm.

6. Some verbs like extract an d inspect which we had considered as relations were identified as concepts

by Text2Onto.

16.5.2 Identification of causes of errors

After an in depth s t ud y of the algorithms o f Text2Onto, following causes of errors were observed:

1. POS tagger used by GATE tags some words incorrectly. For e.g. the verb extract wa s tagged as

noun.

2. Errors may also be due to grammatical mistakes in the corpus file.

3. In the case of Abstract text, e r r o r s may also be due to its length and content. The text c on t a i n e d 4 paragraphs from different papers, and hence had few common terminologies.

4. The algorithms t o extract concepts and instances work independently. Thus, identification of a

term as both concept and instance is not handled in Text2Onto.

5. SubcatRelationExtraction algorithm can extract relations from simple sentences only.

The patterns it can identify are:

Subject + transitive verb + object

Subject + transitive verb + object + preposition + object

Subject + intransitive verb + preposition + object

It identifies only those verbs as relations which come with a singular subject (concept). For e.g. it can

extract the relation build from a tool builds ontology but not from Tools build ontology.

XVII. Improvement Of Text2Onto Results As the result of Text2Onto was not good compared to manual ontology, we did two things to

improve it. First, we added a n algorithm to improve r ela t i on e x t r a c t i o n of Text2Onto. Second, we

performed some experiments on Text2Onto adding meta model to the ontologies built above. The

following section describes the added algorithm and the results and observations from the experiment.

17.1 Algorithm to improve Text2Onto results

The relations extracted from Text2Onto were not interesting at all. Moreover, we found it

difficult to make Text2Onto extract all the relations from Meta model text. So, we decided to add an

algorithm to improve the result of relation extraction in Text2Onto. To extract more relations in order to make a better meta-model, we have added two JAPE rules along with an algorithm to process them.

The added JAPE rules identify sentences in passive voice and sentences with more than one verb (one

auxiliary verb followed by a main verb) with preposition, i.e. the following syntactical patterns:

• Subject + be-verb + Main verb +”by” + Object e.g. Ontology is built by experts

• Subject + auxiliary-verb + Main verb + preposition + Object e.g. Ontology is composed of components

Though these patterns are similar to each other, we added two patterns instead of one in order to

identify these grammatically significant patterns separately. The new algorithm c a n find these patterns

from both meta-model a n d the ontology text. As a result, we could obtain the relations that were not

identified in the text earlier. The added JAPE expressions are as below:

R u l e: Passive Phrase

(

({Noun Phrase} | {Proper Noun Phrase}): object

{Space Token. kind = = space}

({Token. category == VBZ}

| {Token. strings == is}): auxverb


({Token. category == VBN}

| {Token. categories == VBD}) : verb

{S pace Token. ki nd = = space}

({Token .string = = by}): prep {S pace Token. kind = = space}



({Noun Phrase}

| {Proper Noun Phrase}): subject

): passive −−> : Passive. Passive Phrase =

{ rule = ” Passive Phrase "},

: Verb. Verb =

{Rul e = “Passive Phrase "},

: Subject .Subject =

{Rul e = " Passive Phrase "},

: object .Object =

{Rul e = "Passive Phrase "},

: prep. Preposition =

{Rul e = "Passive Phrase "}

R u l e: Multi Verbs with P rep (

({Noun Phrase} | {Proper Noun Phrase}): subject


({Token. category == VBZ}

{Token. category == VB}) : auxverb


({Token. category == VBN}

| {Token. categories == VBD}): verb


({Token. category == IN}): prep


({Noun Phrase} | {Proper Noun Phrase}): object ): mvwp −−>

: mvwp. Multi Verbs with Prep =

{Rule = "Multi Verbs with Prep"},

: Verb. Verb =


: Subject. Subject =


: object. Object=

{Rule = " Multi Verbs with Prep"},

: prep. Preposition =

{Rule = " Multi Verbs with Prep"}

These JAPE expressions are used by GATE application to match the syntactical patterns. Using the

new algorithm, we could extract more relations from the original text.

17.2 Enhancement of Ontology using Meta-Model

The main idea was to try to improve the results of Text2Onto so that the process of building

Ontology can be automated. For this first of all, the text was fed to Text2Onto and shortcomings were

identified. Now in order to overcome this, we thought of feeding the meta model to it so that we can

obtain better extraction of concepts, relations and taxonomy. The experiment was carried out for the three

text document . Results obtained from the text were compared with the results obtained from meta

model plus the text to assess the improvement of Text2Onto results.

17.2.1 Observations

Following observations were made when meta-model and ontology were used on same POM to make

the ontology:

1. All the core concepts were identified and their relevance was increased. (The c o r e concepts

w e r e identified earlier also)

2. The core concepts which are not present in the text had greater values.

3. The relations from the meta-model are identified and included in the ontology. Due to addition of

more patterns, some more relations are identified form the text. However, the useful relations are

limited to core concepts.



4. Hierarchy does not seem to be improved with the algorithms.

VerticalRelationsConceptClassification and PatternConceptClassification. Rather, core concepts with composite

terms are further c la ss i fied by these algorithms. For e.g. Ontology component w a s classified under Component. We have not checked this with WordnetConceptClassification algorithm yet as it give lots of

irrelevant subclass of relations.

From these behaviors, we can present the following ideas of making meta model:

• We can make meta model with the terms not present in the text (point 2)

• If terms present in the text are used for making meta-model, we can write try to increase the

frequency of core concepts in the meta model itself. (Point 1)

• We can avoid composite terms in meta-model a s much as possible. (Point 4)

XVIII. Conclusion

We studied the architecture and working of a tool called Text2Onto that extracts ontologies from textual input and analyzed its results conducting some experiments with three texts. As a part of

the experiments, ontologies were built manually a s well as using the tool and they were compared with

each other. After a detailed analysis of the results, we reached the final conclusions as follows:

1. Relevance measure cannot be a general measure to reject or accept all the terms.

In automated ontology, there are several terms that have high relevance values and are still

rejected by the experts because they do not hold importance for the ontology. Also there are terms

which, even after having a significantly low value of relevance, are accepted. This is also very common

with the core concepts.

Hence the idea of directly using relevance values for accepting or rejecting concepts needs some further

refinement.

2. Meta-Model could not improve the ontology in terms of its is-a hierarchy.

Though meta model increased the relevance value of core concepts, is-a hierarchy was not

improved. Even after having more extracted relations and properly identified core-concepts using the

meta-model, it could not help in making the hierarchy be t t er . Identifying th e relations and concepts

has no effect on subclassof algorithm results. As stated above, there are a few refinements that can be

done for the same. They are suggested in the next section of the report.

XIX. Future Work From the study of Text2Onto and the outcome of the analysis of its results, we could suggest the

following future work and enhancement to Text2Onto.

1. Enhance the use of meta-model t o modify the is-a hierarchy o f the Ontology. After adding corpus to the upper ontology (using the meta-model), we should increase the relevance of

values of the concepts that were identified only for the upper ontology because those core concepts may not

be frequent or very relevant.

2. We can try to manually in cl ude the following kind of hierarchy in the Ontology

Text2Onto uses the following concept while extracting relations:

If A<is related to>B and C <is related to>D then A <is related to>D and C <is related to>B also. This

kind of relation s t r u c t u r e can be exploited t o improve the hierarchy o f concepts. If A <related to>B

and C <related to>D, then C, D can be considered to be a subclass of A and B respectively. Though this

idea may not be applicable for all relations, we can enhance the meta-model significantly for some

relations with same name.

3. Another algorithm can be added where some of the “unwanted” domain-concepts can be predefined and

hence avoided to be included in the ontology. This task will require human interaction before starting to build the ontology because the “interestingness” of the concepts is significantly dependent on the

domain.

A similar approach can be followed for the “infrequent” and “significant” concepts of a particular

domain.

These two approaches can lead us to use relevance measure as significant criteria to accept or reject a

term. Hence the problem of difference in the concepts between manual and automated ontology can be

overcome.



4. As the algorithms a r e executed separately, some terms are identified as both concepts and

instances.

A feature (or post-processing) can be included so that the terms should either be listed as concepts or as individuals bu t not as both. Post processing is also required t o remove unnecessary o r irrelevant

subsumption relation. Synonyms ca n be taken i n t o accoun t to improve t h e r esu l t o f subsumption

algorithm.

5. A module can be added to identify th e acronyms. Examples f r om the text P O M and “probabilistic

ontology model” should be identified as one Term.

References [1] Mariano Fernandez, Asuncion Gomez-Perez, and Natalia Juristo. Methontology: From ontol ogi ca l a r t towards

ont ol o gi ca l engineering. 1997.

[2] Tom Gruber. What is ontology? 1992. http://www-ksl.stanford.edu/kst/what-is-an-ntology.html.

[3] Volker J. Prototype for learning networked ontologies, deliverable d3.8.1 of neon project. 2009.

[4] Volker Johanna and Blomqvist Eva. Evaluation of methods for contextualized learning of networked ontologies. D eliverable

d3.8.2 of neon project. 2008.

[5] Corcho O., Fernandez-Lopez M., Perez A. G., and Lopez-Cima A. Building legal ontologies with methontology and

webode. Pages 142–157, 2003.

[6] Buitelaar P., Cimiano P., and B. Magnini. Ontology learning from text: an overview. Ontology Learning from Text: Methods,

Applications a n d Evaluation, pages 3–12, 2005.

[7] Cimiano P. and Volker J. Text2onto - a framework for ontology learning and data-driven change discovery. 2005.

http://www-ksl.stanford.edu/kst/what-is-an-ntology.html.

Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto

Engineering

Transcript of Tools for Ontology Building from Texts: Analysis and Improvement of the Results of Text2Onto