Use of ontologies in natural language processing

1

Use of Ontologies in Natural Language ProcessingAthman HajhamouComputer and Modeling Laboratory – USMBA- FSDM – Fès

2

SummaryLimitations of classical

approachesUse of OntologyState of the Art.

3

Limitations of classical approaches

The huge number of available documents on the Web makes finding relevant ones a challenging task. Full-text search that is still the most popular form of search provided by the most used services such as Google, is very useful to retrieve documents, but it is normally not suitable to find not yet seen relevant documents for a specic topic.

4

Limitations of classical approaches The major reasons why purely text-based search fails

to find some of the relevant documents are the following:

Vagueness of natural language : synonyms,homographs and inflection of words can all fool algorithms which see search terms only as a sequence of characters.

High-level, vague concepts: High-level, vaguely defined abstract concepts like the Kosovo conict, Industrial Revolution or the Iraq War are often not mentioned explicitly in relevant documents, therefore present search engines cannot find those documents.

5

Limitations of classical approaches

Semantic relations, like the partOf relation, cannot be exploited. For example, if users search for the Great Maghreb, they will not find relevant documents mentioning only Rabat or Morocco.

Time dimension: for handling time specifications, keyword matching is not adequate. If we search documents about the “XX century” using exactly this phrase, relevant resources containing the character sequences like 1945 or 1956 will not be found by simple keyword matching.

6

Limitations of classical approaches Most of the present systems can

successfully handle various inflection forms of words using stemming algorithms, it seems that the lots of heuristics and ranking formulas using text-based statistics that were developed during classical IR research in the last decades cannot master the other mentioned issues. One of the reasons is that term co-occurrence that is used by most statistical methods to measure the strength of the semantic relation between words, is not valid from a linguistic-semantical point of view.

7

Limitations of classical approaches Most of the present systems can

successfully handle various inflection forms of words using stemming algorithms, it seems that the lots of heuristics and ranking formulas using text-based statistics that were developed during classical IR research in the last decades cannot master the other mentioned issues. One of the reasons is that term co-occurrence that is used by most statistical methods to measure the strength of the semantic relation between words, is not valid from a linguistic-semantical point of view.

8

Limitations of classical approaches Besides term co-occurrence-based statistics

another way to improve search effectiveness is to incorporate background knowledge into the search process. The IR community concentrated so far on using background knowledge expressed in the form of thesauri. Thesauri define a set of standard terms that can be used to index and search a document collection (controlled vocabulary) and a set of linguistic relations between those terms, thus promise a solution for the vagueness of natural language, and partially for the problem of high-level concepts.

9

Limitations of classical approaches while intuitively one would expect to see significant

gains in retrieval effectiveness with the use of thesauri, experience shows that this is usually not true.

One of the major cause is the “noise” of thesaurus relations between thesaurus terms. Linguistic relations, such as synonyms are normally valid only between a specific meaning of two words, but thesauri represent those relations on a syntactic level.

Another big problem is that the manual creation of thesauri and the annotation of documents with thesaurus terms is very expensive. As a result, annotations often incomplete or erroneous, resulting in decreased search performance.

10

Use of OntologyOntologies form the basic

infrastructure of the Semantic Web.As ontology we consider any

formalism with a well-defined mathematical interpretation which is capable at least to represent a subconcept taxonomy, concept instances and user defined relations between concepts.

11

Use of Ontology Such formalisms allow a much more

sophisticated representation of background knowledge than classical thesauri. They represent knowledge on the semantic level, i.e., they contain semantic entities(concepts, relations and instances) instead of simple words, which eliminates the mentioned noise from the relations.

They allow specifying custom semantic relations between entities, and also to store well-known facts and axioms about a knowledge domain (including temporal information).

12

Use of OntologyBased on that, ontologies

theoretically solve all of the mentioned problems of full text search. Unfortunately, ontologies and semantic annotations using them are hardly ever perfect for the same reasons that were described at thesauri. Indeed, presently good quality ontologies and semantic annotations are a very scarce resource.

13

State Of the Art

Ontologies as Background Knowledge to Explore Document

Collections

Nathalie Aussenac-Gilles & Josiane Mothe

Institut de Recherche en Informatique de Toulouse

14

Ontologies as Background Knowledge to Explore Document Collections

An alternative way to go beyond bags of words could be to organise indexing terms into a more complex structure than "bags", such as a hierarchy or an ontology. Texts would be indexed by concepts that reflect their meaning rather than words considered as chart lists with all the ambiguity that they convey.

Nathalie A. & Josiane M. promote an approach where information search and exploration take place in a domain-dependant semantic context which is described through its controlled vocabulary organized along hierarchies which are all extracted from a single and unifying domain ontology. Each hierarchy reveals a given point of view on the domain, that is to say a dimension.

15


In this approach, the ontology and derived hierarchies provide the query language for users. Not only can the concept hierarchies be browsed by the user, who can select the terms he wants to add to his query, but they also allow them to explore the information space according to different points of view, through the domain vocabulary and its structure.

Given a domain, a use defines its own information space. It is composed of a selection of hierarchies or dimensions among the set of possible ones. This selection depicts his focus of interest, and lead to identify the associated documents.

16


Dimensions and their visualization define a novel way to provide the users with global views and knowledge of the document collection. A key component of this approach is that the domain ontology allows to define a visual presentation of the entire collection or of a sub-collection based on multi-dimensional analysis, as it is done in OLAP systems.

17


18


19


Strengths : with the help of the ontology, users should express

their needs more easily. documents can be seen under many dimensions (or

points of view) that could be used in order to extract some knowledge from their content.

For the document categorization task, q concept from an ontology can be viewed as a category.

Weaknesses : building an ontology is a complex and time-

consuming task: experts (domain and ontology experts) often manually do it.

the evolution of domain knowledge is problematic, for example new terms appear, other terms are no longer used.

20

State Of the Art

Ontological Profiles as Semantic Domain Representations

Geir Solskinnsbakk & Jon Atle Gulla

Norwegian University of Science and Technology

21


Ontologies for query disambiguation or reformulation seem more promising, though there is a fundamental problem with comparing ontology concepts with query or document terms. Concepts are abstract notions that are not necessarily linked to a particular term. Some times there may be a number of terms that refer to the same concepts, and some times a specific term may be realizations of different concepts depending on the context.

Using conceptual structures to index or retrieve document text requires that there is something bridging the conceptual and real world.

Research indicates that ontologies are of little use if they are not aligned with the documents indexed by the search application.

22


Geir S. & Jon A. G. present an ontology enrichment approach that both bridges the conceptual and real world and ensures that the ontology is well adapted to the documents at hand.

The idea is to provide contextual concept characterizations that reveal how the concepts are referred to semantically in the document collection.

23


An ontological profile is an extension of a domain ontology. The ontology is extended with semantically related terms. These terms are added as vectors for each of the concepts of the ontology.

This means that in the ontological profile each concept is associated with a vector of semantically related terms (concept vector). The terms are given weights to reflect the importance of the semantic relation between the concept and the terms.

The concept vectors typically contain terms that are synonyms to the concept.

24


25


The construction of these ontological profiles is based on three different aspects of the content of the documents used.

The first is that we apply statistical techniques, counting the frequency of the terms in the documents. Terms that co-occur with a concept more frequent are hypothesized to be more relevant for a concept than terms that do not co-occur as frequently.

The second is that we apply linguistic techniques, i.e. stemming, to collapse certain terms into a single form.

The third aspect is that we use a proximity analysis of the text. The assumption that lies behind the proximity analysis is that the closer terms are found in the text, the more semantically related they are.

26


27


We give the highest weight to terms that are found in the same sentence as the concept name phrase (the highest semantic coherence), terms found in the same paragraph as the concept are given lower weight than sentence-terms, and higher than document terms.

The basis for the weight calculation is the term frequency for each term found in the relevant documents.

Applying the familiar tf*idf score to the frequencies we get closer to the final representation of the vectors. The idf factor gives more importance to terms that are found in few documents across the document collection.

28


is the term frequency for term i in concept vector j, is the term frequency for term i in document vector k, D, P, and S are the possibly empty sets of relevant documents, paragraph documents and sentence documents as signed to j, and a=01, b=10, and c=100 are the constant modifiers for documents, paragraph documents, and sentence documents, respectively.

29


is the tfidf score for term i in concept vector j, is the term frequency for term i in concept vector j, is the frequency of the most frequent occurring term i in concept vector j, N is the number of concept vectors, and n is the number of concept vectors containing term i.

30


Strengths : This approach based on ontological profile is used as a tool

for semantic reformulation of queries on top of a standard vector space based search engine (Appach Lucene), using the reformulated query as a query into the index. This approach lets the system hide from the user the fact that an ontology is used, and the user is only faced with entering familiar keyword queries.

Weaknesses : In this approach the concept name is considered as a phrase

query into the three indexes, and all documents containing the phrase are assigned to the concept as relevant. Of course, using the concept name as a phrase query into the three indexes imposes a challenge; some of the concept names are artificial in their construction or are not used in the form given in the concept. This means that many of the concepts are not found during the assignment of documents to the concepts.

31

State Of the Art

An Ontology-Based Information Retrieval Model

David Vallet, Miriam Fernández & Pablo Castells

Universidad Autónoma de Madrid

32


David V, Miriam F. & Pablo C. propose an ontology-based retrieval model meant for the exploitation of full-fledged domain ontologies and knowledge bases, to support semantic search in document repositories. In contrast to boolean semantic search systems, in this perspective full documents, rather than specific ontology values from a KB, are returned in response to user information needs. The search system takes advantage of both detailed instance-level knowledge available in the KB, and topic taxonomies for classification.

This approach includes an ontology-based scheme for the semi-automatic annotation of documents, and a retrieval system. The retrieval model is based on an adaptation of the classic vector-space model, including an annotation weighting algorithm, and a ranking algorithm.

33


David V, Miriam F. & Pablo C. propose an ontology-based retrieval model meant for the exploitation of full-fledged domain ontologies and knowledge bases, to support semantic search in document repositories. In contrast to boolean semantic search systems, in this perspective full documents, rather than specific ontology values from a KB, are returned in response to user information needs. The search system takes advantage of both detailed instance-level knowledge available in the KB, and topic taxonomies for classification.

This approach includes an ontology-based scheme for the semi-automatic annotation of documents, and a retrieval system. The retrieval model is based on an adaptation of the classic vector-space model, including an annotation weighting algorithm, and a ranking algorithm.

34


The system requires that the knowledge base be constructed from three main base classes: DomainConcept, Taxonomy, and Document.

DomainConcept should be the root of all domain classes that can be used (directly or after subclassing) to create instances that describe specific entities referred to in the documents.

Document is used to create instances that act as proxies of documents from the in-formation source to be searched upon.

Taxonomy is the root for class hierarchies that are merely used as classification schemes, and are never instantiated. These taxonomies are expected to be used as a terminology to annotate documents and concept classes, using them as values of dedicated properties.

35


The predefined base ontology classes described above are complemented with an annotation ontology that provides the basis for the semantic indexing of documents with non-embedded annotations.

Documents are annotated with concept instances from the KB by creating instances of the Annotation class, provided for this purpose. Annotation has two relational properties, instance and document, by which concepts and documents are related together. Reciprocally, DomainConcept and Document have a multivalued annotation property.

Annotations can be created manually by a domain expert, or semi-automatically. The subclasses ManualAnnotation and AutomaticAnnotation are used respectively

36


DomainConcept instances use a label property to store the most usual text form of the concept class or instance. This property is multivalued, since instances may have several textual lexical variants.

Whenever the label of an instance is found, an annotation is created between the instance and the document. In the system, documents can be annotated with classes as well, by assigning labels to concept classes.

The annotations are used by the retrieval and ranking module

37


In the classic vector-space model, keywords appearing in a document are assigned weights reflecting that some words are better at discriminating between documents than others.

In this approach similarly annotations are assigned a weight that reflects how relevant the instance is considered to be for the document meaning.

Weights are computed automatically by an adaptation of the TF-IDF algorithm based on the frequency of occurrence of the instances in each document.

38


wij is the weight of instance Ii for document Dj, is the number of occurrences of Ii in Dj, is the frequency of the most repeated instance in Dj, ni is the number of documents annotated with Ii, and N is the total number of documents in the search space.

39


The system takes as input a formal RDQL query. This query could be generated from a keyword query, a natural language query, a form-based interface where the user can explicitly select ontology classes and enter property values, or more sophisticated search interfaces.

The RDQL query is executed against the knowledge base, which returns a list of instance tuples that satisfy the query and the documents that are annotated with these instances are retrieved, ranked, and presented to the user.

40


41


Strengths :Better recall when querying for class

instances and using class hierarchies and rules.

Better precision by using query weights and structured semantic queries.

Weaknesses :The degree of improvement of this

semantic retrieval model depends on the completeness and quality of the ontology, the KB, and the concept labels.

42

State Of the Art

Improving information retrieval effectiveness by using domain knowledge stored in ontologies

Gabor Nagypal

University of Karlsruhe, Germany

43


The quality of results that traditional full-text search engines provide is still not optimal for many types of user queries. Especially the vagueness of natural languages, abstract concepts, semantic relations and temporal issues are handled inadequately by full-text search. Ontologies and semantic metadata can provide a solution for these problems.

The goal of this thesis is to examine and validate whether and how ontologies can help improving retrieval effectiveness in information systems, considering the inherent imperfection of ontology-based domain models and annotations.

This work examines how ontologies can be optimally exploited during the information retrieval process, and proposes a general framework which is based on ontology-supported semantic metadata generation and ontology-based query expansion.

44


This research evaluates the following hypotheses :

Ontologies allow to store domain knowledge in a much more sophisticated form than thesauri. We therefore assume that by using ontologies in IR systems a significant gain in retrieval effectiveness can be measured.

The better (more precise) an ontology models the application domain, the more gain is achieved in retrieval effectiveness.

It is possible to diminish the negative effect of ontology imperfection on search results by combining different ontology-based heuristics during the search process.

It is a well-known fact that there is a trade-of between algorithm complexity and performance. This insight is also true for ontologies. Still, assumption of this approach is that by combining ontologies with traditional IR methods, it is possible to provide results with acceptable performance.

45


Background knowledge stored in the form of ontologies can be used at practically every step of the IR process.

In this work, solutions are there fore provided for the issues of ontology based query extension, ontology-supported query formulation and ontology-supported metadata generation (indexing).

This leads to a conceptual system architecture where the Ontology Manager component has a central role, and it is extensively used by the Indexer, Search Engine and GUI components .

46


47


The information model defines how documents and the user query are represented in the system. The model used in this work represents the content of a resource as a weighted set of instances (bag of ontology instances) from a suitable domain ontology (the conceptual part) together with a weighted set temporal intervals (the temporal part).

The representation of the conceptual part is practically identical with the information model used by classical IR engines built on the vector space model, with the difference that vector terms are ontology instances instead of words in a natural language.

48


Time as a continuous phenomenon has different characteristics than the discrete conceptual part of the information model. The first question according time is how to define similarity among weighted sets of time intervals.

A possible solution which is being considered, is to use the temporal vector space model. The main idea of the model is that if we choose a discrete time representation, the lowest level of granules can be viewed as terms and the vector space model is applicable also for the time dimension.

49


During query formulation we use the ontology only to disambiguate queries specified in textual form. By running classical full-text search on ontology labels, users only have to choose the proper term interpretation.

Query process applies various ontology-based heuristics one-by-one to create separate queries which are executed independently using a traditional full-text engine. The ranked results are then combined together to form the final ranked result list. The combination of results is based on the belief network model which allows the combination of various evidences using Bayesian inference.

50


51


Strengths : This work validate that the proposed solution significantly

improves retrieval effectiveness of information systems and thus provides a strong motivation for developing ontologies and semantic metadata.

The gradual approach described allows a smooth transition from classical text-based systems to ontology-based ones.

Weaknesses : A problem with the temporal vector space approach is the

potentially huge number of time granules which are generated for big time intervals. E.g. to represent the existence time of concepts such as the Middle Ages, potentially many tens of thousand terms are needed if we use days as granules.

Use of ontologies in natural language processing

Technology

Transcript of Use of ontologies in natural language processing