Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

21
Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation Panos Alexopoulos, Carlos Ruiz, Jose Manuel Gomez Perez 1st Semantic Web and Information Extraction Workshop, Galway, Ireland, October 9th, 2012

description

 

Transcript of Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

Page 1: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

Scenario-Driven Selection and Exploitation ofSemantic Data for Optimal Named Entity

Disambiguation

Panos Alexopoulos, Carlos Ruiz, Jose Manuel Gomez Perez

1st Semantic Web and Information Extraction Workshop,

Galway, Ireland, October 9th, 2012

Page 2: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

2

Introduction Problem Definition and Paper

Focus Approach Overview and Rationale

Proposed Disambiguation Framework Disambiguation Evidence Model Entity Disambiguation Process

Framework Evaluation Evaluation Process Evaluation Results

Conclusions and Future Work

Agenda

Page 3: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

3

Problem Definition

Introduction

●Named entity resolution involves detecting mentions of named entities (e.g. people, organizations or locations) within texts and mapping them to their corresponding entities in a given knowledge source.

●One important challenge in this task is the correct disambiguation of the detected entities.

●For example:

● “Siege of Tripolitsa took place in Tripoli with Theodoros Kolokotronis being the leader of the Greeks. This event marked an early victory for the fight for independence from Turkey but it was also a massacre against the Muslim and Jewish population of the city”.

● The term “Tripoli” here refers to http://dbpedia.org/resource/Tripoli,_Greece but it can be mistaken, for example, with Tripoli in Libya or that in Lebanon.

Entity Resolution & Disambiguation

Page 4: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

4

Disambiguation Approaches

Introduction

●The majority of disambiguation approaches rely on the strong contextual hypothesis that terms with similar meanings are often used in similar contexts.

●The role of these contexts is typically played by already annotated documents (e.g. wikipedia articles) which are used to train term classifiers.

●These classifiers link a term to its correct meaning entity, based on the similarity between the term’s textual context and the contexts of its potential entities.

●Some more recent approaches utilize semantic structures in order to determine this similarity in a semantic way.

●The effectiveness of these latter approaches is highly dependent on:

● The availability of comprehensive semantic information.

● The degree of alignment between the content of the texts to be disambiguated and the semantic data to be used.

Page 5: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

5

Alignment and its Importance

Introduction

●Alignment means that the ontology’s elements should cover the domain(s) of the texts to be disambiguated but should not contain other additional elements that: ●Do not belong to the domain.●Do belong to it but do not appear in the texts.

●For example assume the text “Ronaldo scored two goals for Real Madrid“ from a contemporary football match description.

●To disambiguate the term “Ronaldo” using an ontology, the only contextual evidence that can be used is the entity “Real Madrid”.

●Yet there are two players with that name that are semantically related to Real:● Cristiano Ronaldo (current player) ● Ronaldo Luis Nazario de Lima (former player).

●This means that if both relations are considered then the term will not be disambiguated.

Page 6: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

6

Towards Better Alignment

Introduction

● In the previous example the fact that the text describes a contemporary football match suggests that, in general, the relation between a team and its former players is not expected to appear in it.

●Thus, for such texts, it would make sense to ignore this relation in order to achieve more accurate disambiguation.

●Based on this observation, we make two claims:

● That there are certain scenarios where there is available a priori knowledge about what entities and relations are expected to be present in the text.

● That this knowledge can be exploited for better alignment between semantic information and content leading to more effective disambiguation.

●To verify these claims we define an entity disambiguation framework that can perform better disambiguation in such scenarios.

Page 7: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

7

Approach

Proposed Framework

●We target the task of entity disambiguation based on the intuition that a given ontological entity is more likely to represent the meaning of an ambiguous term when there are many ontologically related to it entities in the text.

●E.g. in the example text the entities “Siege of Tripolitsa” and “Theodoros Kolokotronis” indicate that the term “Tripoli” refers to the city of Greece.

●These evidential entities are derived from one or more domain ontologies.

●However, which entities and to what extent may serve as evidence in a given application scenario depends on the domain and expected content of the texts.

●For that, the key ability our framework provides to its users is to construct, in a semi-automatic manner, semantic evidence models for specific disambiguation scenarios and use them to perform entity disambiguation within them.

Page 8: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

8

Framework Components

Proposed Framework

●A Disambiguation Evidence Model that contains the semantic entities that may serve as disambiguation evidence for the scenario’s target entities in the given scenario.

● Each pair of a target entity and an evidential one is accompanied by a degree that quantifies the latter’s evidential power for the given target entity.

●A Disambiguation Evidence Model Construction Process that builds, in a semi-automatic manner, a disambiguation evidence model for a given scenario.

●An Entity Disambiguation Process that uses the evidence model to detect and extract from a given text terms that refer to the scenario’s target entities.

● Each term is linked to one or more possible entity uris along with a confidence score calculated for each of them.

● The entity with the highest confidence should be the one the term actually refers to.

Page 9: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

9

Disambiguation Evidence Model

Proposed Framework

●Defines for each ontology entity which other instances and to what extent should be used as evidence towards its correct meaning interpretation.

●It consists of entity pairs where a particular entity provides quantified evidence for a another one.

Page 10: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

10

Evidence Model Construction

Proposed Framework

●Construction of the evidence model depends on the characteristics of the domain and the texts.

●The first step of the construction is manual and involves:

● The identification of the concepts whose instances we wish to disambiguate (e.g. locations)

● The determination, for each of these concepts, of the related to them concepts whose instances may serve as contextual disambiguation evidence/

●For example, in texts that describe historical events, some concepts whose instances may act as location evidence are related locations, historical events, and historical groups and persons.

● The identification, for each pair of evidence and target concept, of the relation paths that links them.

Page 11: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

11

Evidence Model Construction

Proposed Framework

●The result of this first step is a table like the following ones:

Page 12: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

12

Evidence Model Construction

Proposed Framework

●Based on these tables, the second step of the construction is automatic and involves the generation of the target-evidence entity pairs along with a disambiguation evidential strength.

●This strength is inversely proportional to the number of different same-name target entities a given evidential entity provides evidence for.

●For example, “Getafe” provides evidence for “Pedro Leon” to a strength of 0.5 because it has another player called Pedro.

Page 13: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

13

Entity Resolution Process

Proposed Framework

●Step 1: We extract from the text the terms that possibly refer to the target entities as well as those that refer to their respective evidential entities.

● Extraction is performed with Knowledge Tagger, an in-house tool based on GATE.

●Step 2: Using the evidential entities we compute for each extracted target entity term the confidence that it refers to a particular target entity.

● The target entity with the highest confidence is expected to be the correct one.

Page 14: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

14

Disambiguation Example

Proposed Framework

Correct disambiguation of the term Atletico

Page 15: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

15

Evaluation Process

Framework Evaluation

●Two disambiguation scenarios:● Football match descriptions.● Texts describing military conflicts.

●DBPedia as a source of semantic information in both cases.

●Disambiguation effectiveness measured through precision and recall.

●Evaluation results were compared to those achieved by two publicly available semantic annotation and disambiguation systems;

● DBPedia Spotlight● AIDA

●The two systems:● Use also DBPedia as a knowledge source.● Provide the users the capability to select the classes whose instances are to

be included in the process.

Description

Page 16: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

16

Evaluation Results

Framework Evaluation

●50 texts describing football matches.

●E.g. “It's the 70th minute of the game and after a magnificent pass by Pedro, Messi managed to beat Claudio Bravo. Barcelona now leads 1-0 against Real."

Football Match Descriptions Scenario

Disambiguation Results

Page 17: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

17

Evaluation Results

Framework Evaluation

●50 historical texts describing military conflicts.

●E.g. “The Siege of Augusta was a significant battle of the American Revolution. Fought for control of Fort Cornwallis, a British fort near Augusta, the battle was a major victory for the Patriot forces of Lighthorse Harry Lee and a stunning reverse to the British and Loyalist forces in the South”.

Military Conflict Texts Scenario

Disambiguation Results

Page 18: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

18

Key Points

Conclusions and Future Work

●We proposed a novel framework for optimizing named entity disambiguation in well-defined and adequately constrained scenarios through the customized selection and exploitation of semantic data.

●Our purpose was not to build another generic disambiguation system but rather a reusable framework that can:

● Be relatively easily adapted to the particular characteristics of the domain and application scenario at hand.

● Exploit these characteristics to increase the effectiveness of the disambiguation process.

●The key aspect of the framework is the semi-automatic process it defines for selecting the optimal evidence model for the scenario at hand.

Page 19: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

19

Key Points

Conclusions and Future Work

●Comparative evaluation in two specific scenarios verified the framework’s superiority over existing approaches that are designed to work in open domains and unconstrained scenarios.

●This verified our hypothesis that the scenario adaptation capabilities of such generic disambiguation systems can be inadequate in certain scenarios.

●Of course, the framework’s usability and effectiveness is directly proportional to the content specificity of the texts to be disambiguated and the availability and quality of a priori semantic knowledge about their content.

● The greater these two parameters are, the more applicable is our approach and the more effective the disambiguation is expected to be.

● The opposite is true as the texts become more generic and the information we have out about them more scarce.

Page 20: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

20

Framework Extensions

Conclusions and Future Work

●Fully automated construction of the disambiguation evidence model.

● Challenge here is how to automatically identify the text’s domain/topic.

●Combination with statistical methods for cases where available domain semantic information is incomplete.

● Challenge here is how to select the optimal ratio of ontological evidence v.s. statistical one.

●Development of tool to enable users to dynamically build such models out of existing semantic data and use them for disambiguation purposes

Page 21: Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

21

Contact iSOCO

Thank you!

Questions?

BarcelonaTel +34 935 677 200Edificio Testa AC/ Alcalde Barnils, 64-68 St. Cugat del Vallès08174 Barcelona

ValenciaTel +34 963 467 143Oficina 107C/ Prof. Beltrán Báguena, 446009 Valencia

PamplonaTel +34 948 102 408Parque Tomás Caballero, 2, 6º-4ª31006 Pamplona

Dr. Panos AlexopoulosSenior Researcher

[email protected]

MadridTel +34 913 349 797Av. del Partenón, 16-18, 1º7ªCampo de las Naciones28042 Madrid