Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
A Semantic Best-Effort Approach for Extracting Structured
Discourse Graphs from WikipediaAndré Freitas, Danilo Carvalho, J. C. P. da
Silva, Sean O’Riain, Edward Curry
Digital Enterprise Research Institute www.deri.ie
Outline
Motivation Representation
Requirements Semantic Best-effort Representation
Extraction Graphia Extractor Preliminary Evaluation Extraction Examples
Conclusion
Digital Enterprise Research Institute www.deri.ie
Motivation
Digital Enterprise Research Institute www.deri.ie
Motivation
Digital Enterprise Research Institute www.deri.ie
Motivation
Digital Enterprise Research Institute www.deri.ie
Motivation
Digital Enterprise Research Institute www.deri.ie
Motivation
Linked Data Terminological and structural regularity Shared semantic agreement between data consumers
Natural language texts No terminological or structural
regularity Highly contextualized Complex semantic dependency
relations Ambiguity Information selection/normalization
- vocabulary constraints+ entity-centric+ pay-as-you-go data semantics = semantic best-effort
Digital Enterprise Research Institute www.deri.ie
Motivation
Vocabulary-independent (schema-free queries) How to abstract users from knowing the data
representation? Semantic matching
Schemaless databases in the limit demands vocabulary-independency
How information extraction is reshaped in this scenario?
Digital Enterprise Research Institute www.deri.ie
Motivational Scenario
What is the relationship between Barack Obama and Indonesia?
Sentence: From age six to ten, Obama attended local schools in Jakarta, including Besuki Public School and St Francis Assisi School.
Semantic Best-effort Extraction
Entity-centric text representation
Digital Enterprise Research Institute www.deri.ie
Representation
Digital Enterprise Research Institute www.deri.ie
Computational Linguistics Perspective
What is already there to represent NL? Discourse Representation Theory (DRT) Semantic Role Labeling (SRL)
Digital Enterprise Research Institute www.deri.ie
Discourse Representation Theory (DRT) “The key idea behind (...) Discourse Representation Theory
is that each new sentence of a discourse is interpreted in the context provided by the sentences preceding it.”
van Eijck and Kamp Models propositions in discourse (multiple sentences). Discourse representation structures (DRS).
John enters a card. Every card is green.
Digital Enterprise Research Institute www.deri.ie
Semantic Role Labeling (SRL)
Shallow semantic parsing. Detection of arguments associated with a predicate. Associated semantic types to arguments.
Bill cut his hair with a razor
[Agent Bill] cut [Patient his hair] [Instrument with a razor.]
Digital Enterprise Research Institute www.deri.ie
Semantic Best-Effort
Objectives: Entity-centric & Standardized: easier to integrate with
other resources Remove the formal constraints and the ‘baggage’ from
existing approaches Representation robust to extraction limitations/errors
Digital Enterprise Research Institute www.deri.ie
Semantic Best-Effort Requirements
Text segmentation into (s,p,o)s Context representation Conceptual model independency Resolve co-references (pay-as-you-go) Represent recurrent discourse structures Standardized representation (RDF(S)) Principled interpretation (compositionality)
Digital Enterprise Research Institute www.deri.ie
Examples
- Text segmentation into (s,p,o)s
- Context representation- Resolve co-references (pay-as-you-go)- Conceptual model independency
Digital Enterprise Research Institute www.deri.ie
Examples
- Context representation
Digital Enterprise Research Institute www.deri.ie
Examples
Digital Enterprise Research Institute www.deri.ie
Examples
- Represent recurrent discourse structures
Digital Enterprise Research Institute www.deri.ie
Examples
- Represent recurrent discourse structures
- Resolve co-references (pay-as-you-go)
Digital Enterprise Research Institute www.deri.ie
Examples
- Represent recurrent discourse structures
Digital Enterprise Research Institute www.deri.ie
SDG Elements
Named, non-named entities and properties Quantifiers & operators Triple Trees Context elements Co-Referential elements Resolved & normalized entities
Digital Enterprise Research Institute www.deri.ie
Graph Patterns
Digital Enterprise Research Institute www.deri.ie
[[Interpretation]]
Graph traversal – deref sequence
Digital Enterprise Research Institute www.deri.ie
Extraction
Digital Enterprise Research Institute www.deri.ie
SBE Graph Extraction Tool
Digital Enterprise Research Institute www.deri.ie
Extraction Pipeline Architecture
Subject Predicate Object Prepositional phrase & Noun complement Reification Time
Digital Enterprise Research Institute www.deri.ie
Preliminary Evaluation
1033 relations (triples) from 150 sentences from 5 randomly selected Wikipedia articles
Manually classified the graphs: error categories and accuracy.
Digital Enterprise Research Institute www.deri.ie
Preliminary Evaluation
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Other Extraction Examples
Digital Enterprise Research Institute www.deri.ie
Conclusion
Main direction for improvement is completeness Aligned with the pay-as-you-go scenario
Still need to define clear criteria for what you can’t extract There is still a long way to go (e.g. complex subordination) Investigation using existing n-ary relations patterns Context (reification) should be a first-class citizen in the
representation of natural language Focus on getting the semantic pivots (rigid designators)
right Worth putting effort on enumerable patterns (timestamps,
operators)
Top Related