Representing Texts as contextualized Entity Centric Linked Data Graphs

54
, Representing Texts as Contextualized Entity-Centric Linked Data Graphs Andr´ e Freitas 1 Jo˜ ao C. P. da Silva 2 Danilo S. Carvalho 2 Se´ an O’Riain 1 Edward Curry 1 1 DERI - Ireland 2 Universidade Federal Rio de Janeiro - Brazil August 27, 2013 Freitas, Silva, Carvalho, O’Riain, Curry 1/ 54

description

The integration of a small fraction of the information present in the Web of Documents to the Linked Data Web can provide a significant shift on the amount of information available to data consumers. However, information extracted from text does not easily fit into the usually highly normalized structure of ontology-based datasets. While the representation of structured data assumes a high level of regularity, relatively simple and consistent conceptual models, the representation of information extracted from texts need to take into account large terminological variation, complex contextual/dependency patterns, and fuzzy or conflicting semantics. This work focuses on bridging the gap between structured and unstructured data, proposing the representation of text as structured discourse graphs (SDGs), targeting an RDF representation of unstructured data. The representation focuses on a semantic best-effort information extraction scenario, where information from text is extracted under a pay-as-you-go data quality perspective, trading terminological normalization for domain-independency, context capture, wider representation scope and maximization of textual information capture.

Transcript of Representing Texts as contextualized Entity Centric Linked Data Graphs

Page 1: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Representing Texts as Contextualized

Entity-Centric Linked Data Graphs

Andre Freitas1 Joao C. P. da Silva2 Danilo S. Carvalho2

Sean O’Riain1 Edward Curry1

1DERI - Ireland2Universidade Federal Rio de Janeiro - Brazil

August 27, 2013

Freitas, Silva, Carvalho, O’Riain, Curry 1/ 54

Page 2: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Outline

Motivation & Objective

Existing Approaches

Structured Discourse Graphs (SDGs)

Representation RequirementsSemantic Model Elements & Graph PatternsSemantic Model - Formalization

Extraction

Graphia ExtractorPreliminary EvaluationExtraction Examples

Conclusion & Future Work

Freitas, Silva, Carvalho, O’Riain, Curry 2/ 54

Page 3: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Motivation & Objective

Freitas, Silva, Carvalho, O’Riain, Curry 3/ 54

Page 4: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Motivation

Freitas, Silva, Carvalho, O’Riain, Curry 4/ 54

Page 5: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Motivation

Freitas, Silva, Carvalho, O’Riain, Curry 5/ 54

Page 6: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Motivation

Integration of unstructured information into the Linked Data Web

Natural Language Texts

Terminological variation

Complex context patterns

Ambiguous sentences

Linked Data

Terminological and structural regularity

Consensual semantics - agreement between data consumers

Freitas, Silva, Carvalho, O’Riain, Curry 6/ 54

Page 7: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Motivation

Traditional Information Extraction

Triples - Primary concerns

accuracy

consistency

lexical and structural normalization (schema/ontology)

Complemented - Best-Effort Information Extraction

domain-independency

context capture

maximization of the text semantics representation

wider extraction scope

Freitas, Silva, Carvalho, O’Riain, Curry 7/ 54

Page 8: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Motivation

What is the relationship between Barack Obama and Indonesia?

Freitas, Silva, Carvalho, O’Riain, Curry 8/ 54

Page 9: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Objective

Structured Discourse Graph (SDG) Model

Improve semantic integration, representation and interpretation of

unstructured text within the context of Linked Data.

Freitas, Silva, Carvalho, O’Riain, Curry 9/ 54

Page 10: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Objective

Features

Entity-Centric Data Model

Ontology/Vocabulary Agnostic Representation

Context Representation

Graph Extraction

Interpretation (Navegational) Model

Freitas, Silva, Carvalho, O’Riain, Curry 10/ 54

Page 11: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Existing Approaches

Freitas, Silva, Carvalho, O’Riain, Curry 11/ 54

Page 12: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Existing Approaches

Simple Relations (Triples)

Discourse Representation Structure (DRS)

Ontology-based

Freitas, Silva, Carvalho, O’Riain, Curry 12/ 54

Page 13: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Our Work

Focuses on:

Providing a principled description of a representation modelcomplementary to existing approaches

Supports context representation

Defining a graph representation approach for texts which canbe easily integrated to the Linked Data Web

Freitas, Silva, Carvalho, O’Riain, Curry 13/ 54

Page 14: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Structured Discourse Graphs - SDGs

Representation Requirements

Freitas, Silva, Carvalho, O’Riain, Curry 14/ 54

Page 15: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Representation Requirements

Entity-centric graph model

Entity pivot : named entity (subject or object)

Document Centric ⇒ Entity Centric

Maximized representation of text semantics

Information in sentence ⇒ graph representation

Resulting graph should support an algorithmic interpretation ofthe extracted SDG

Conceptual model independency

Ontology agnostic

Freitas, Silva, Carvalho, O’Riain, Curry 15/ 54

Page 16: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Representation Requirements

Context Capture & Representation

Contextual information related to a triple statement should berepresented in the extracted graph

Contextual statements (such as temporality) define the contextin which another statement holds

Dependencies between sentences in the text should also bemade explicit

Freitas, Silva, Carvalho, O’Riain, Curry 16/ 54

Page 17: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Representation Requirements

Pay-as-you-go semantic reference

Unstructured text may contain complex semantic dependencies

Should support the evolution and refinement of the semanticmodel

Standardized representation compatibility

Maximize compatibility with a standards-based datarepresentation format, facilitating

graph integration

interoperability on the Web

Freitas, Silva, Carvalho, O’Riain, Curry 17/ 54

Page 18: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Structured Discourse Graphs - SDGs

Semantic Model Elements & Graph Patterns

Freitas, Silva, Carvalho, O’Riain, Curry 18/ 54

Page 19: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements

In late 1988, Obama entered Harvard Law School.

Text Segmentation: (subject,predicate,object)

Named Entities

Refer to the description of entities for which one or many rigiddesignators stands for the referent

Rigid designators: people, places, organization, biologicalspecies, substances, etc

Freitas, Silva, Carvalho, O’Riain, Curry 19/ 54

Page 20: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements

In late 1988, Obama entered Harvard Law School.

Text Segmentation: (subject,predicate,object)

Named Entities

Refer to the description of entities for which one or many rigiddesignators stands for the referent

Rigid designators: people, places, organization, biologicalspecies, substances, etc

Freitas, Silva, Carvalho, O’Riain, Curry 20/ 54

Page 21: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements

In late 1988, Obama entered Harvard Law School.

Text Segmentation: (subject,predicate,object)

Named Entities

Refer to the description of entities for which one or many rigiddesignators stands for the referent

Rigid designators: people, places, organization, biologicalspecies, substances, etc

Freitas, Silva, Carvalho, O’Riain, Curry 21/ 54

Page 22: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements

In late 1988, Obama entered Harvard Law School.

Co-Referential Elements

Two types co-references: pronominal and non-pronominal

Co-references can refer to either intra or inter sentences

Substituting the co-referent term by the named entity

Can corrupt the semantics of the representation

Freitas, Silva, Carvalho, O’Riain, Curry 22/ 54

Page 23: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements

In late 1988, Obama entered Harvard Law School.

Context Elements

A semantic interpretation may depend on different contexts(temporal context)

The main contextual information is intra-sentence

Intra-sentence context ⇒ reification

Freitas, Silva, Carvalho, O’Riain, Curry 23/ 54

Page 24: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements

In late 1988, Obama entered Harvard Law School.

Quantifiers & Generic Operators

Quantifier: one, two, (cardinal numbers), many (much),some, all, thousands of, one of, several, only, most ofNegation: notModal: could, may, shall, need to, have to, must, maybe,always, possiblyComparative: largest, smallest, most, largest, smallest, thesame, is equal, like, similar to, more than, less than

Freitas, Silva, Carvalho, O’Riain, Curry 24/ 54

Page 25: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements & Graph Patterns

In late 1988, Obama entered Harvard Law School.

Text segmentation: (Obama,entered,Harvard Law School)

Named entities: Obama, Harvard Law School

Resolve co-references: Barack Obama

Context representation: time

Freitas, Silva, Carvalho, O’Riain, Curry 25/ 54

Page 26: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements & Graph Patterns

In late 1988, Obama entered Harvard Law School.

Resolved & Normalized Entities

Resolved entities: a node-substitution in the graph was madefrom a co-reference to a named entity

Normalized entities: entities are transformed to a normalizedform (September 1st of 2010 to 01/09/2010)

Freitas, Silva, Carvalho, O’Riain, Curry 26/ 54

Page 27: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements & Graph Patterns

Later in 2007, Obama sponsored an amendment to theDefense Authorization Act to add safeguards for

personality-disorder military discharges.

Freitas, Silva, Carvalho, O’Riain, Curry 27/ 54

Page 28: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements & Graph Patterns

Freitas, Silva, Carvalho, O’Riain, Curry 28/ 54

Page 29: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements & Graph Patterns

He served three terms representing the 13th District in theIllinois Senate from 1997 to 2004.

Non-Named (Generic) Entities

Non-named entities map to non-rigid designators

Are more subject to vocabulary variation

Have more complex compositional patterns

Freitas, Silva, Carvalho, O’Riain, Curry 29/ 54

Page 30: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements & Graph Patterns

Following high school, Obama moved to Los Angeles in 1979to attend Occidental College.

Triple Trees

Not all facts extracted can be represented in one triple.

Transformation from the syntactic tree to a set of triples

The sentence subject defines the root node

Interpretation: DFS traversal of the tree

Freitas, Silva, Carvalho, O’Riain, Curry 30/ 54

Page 31: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements & Graph Patterns

He won election to the U.S. Senate in Illinois in November2004.

Pronominal Co-Reference

Freitas, Silva, Carvalho, O’Riain, Curry 31/ 54

Page 32: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements & Graph Patterns

On Agust 23, Obama announced his election of DelawareSenator Joe Biden as his vice presidential running mate.

Pronominal Co-Reference

Freitas, Silva, Carvalho, O’Riain, Curry 32/ 54

Page 33: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements & Graph Patterns

As a member of the Senate Foreign Relations Commitee,Obama made official trips to Eastern Europe, the Middle

East, Central Asis and Africa.

Conjunctive Co-Reference

Freitas, Silva, Carvalho, O’Riain, Curry 33/ 54

Page 34: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

SDGs - Semantic Model Elements & Graph Patterns

Freitas, Silva, Carvalho, O’Riain, Curry 34/ 54

Page 35: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Structured Discourse Graphs - SDGs

Semantic Model - Formalization

Freitas, Silva, Carvalho, O’Riain, Curry 35/ 54

Page 36: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Semantic Model - Formalization

Graph pattern: atomic graph structure which maps to adiscourse structure

Named and Non-named Entities: [[ne]], [[∼ ne]] ∈ U,where U is a set of IRIs

Basic Triple: tr = (es , p, eo) where es , eo representnamed/non-named entities associated with the subject (s) andobject (o), and p represents a relation between es and eo ,interpreted by [[tr ]] = ([[es ]], [[p]], [[eo ]]) ∈ U × U × U

Core Triple: trc = (nes , p, neo)

Freitas, Silva, Carvalho, O’Riain, Curry 36/ 54

Page 37: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Semantic Model - Formalization

Reification Triple: trrei = (tr , reilink , reiobj )

tr - basic triplesreilink - relationreiobj - reification object

Temporal Reification: special reification triple

reilink : timereiobj : explicit or implicit data references

Interpretation:[[trrei ]] = ([[tr ]], [[reilink ]], [[reiobj ]]) ∈ U × U × U

Freitas, Silva, Carvalho, O’Riain, Curry 37/ 54

Page 38: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Semantic Model - Formalization

Quantifier Operators & Generic Operators:opt = (eo , oplink , op)

eo - object element in a basic triple trop - specific operator of eo wrt oplink

Interpretation:

[[opt]] = ([[proj3(tr)]], [[oplink ]], [[op]]) = ([[eo ]], [[oplink ]], [[op]])

where proj3(tr) = proj3((es , p, eo)) = eo

Freitas, Silva, Carvalho, O’Riain, Curry 38/ 54

Page 39: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Semantic Model - Formalization

Conjunctive Co-Reference: ccr =⋃n

i=0{(e, conjlinki , nei )}

Interpretation:

[[ccr ]] = {([[e]], [[conjlinki ]], [[nei ]]) ∈ U × U × U such that

[[e]] = [[proj3(tr)]] and (∧ni=0[[nei ]] sameas [[e]])}

Freitas, Silva, Carvalho, O’Riain, Curry 39/ 54

Page 40: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Semantic Model - Formalization

Possessive/Reflexive/Demonstrative Co-Reference:

pcr = {(∼ nei , coreflink , pr), (pr , coreflink , ej )}

Interpretation:

[[pcr ]] ={([[proj1(tr)]], [[coreflink ]], [[pr ]]), ([[pr ]], [[coreflink ]], [[proj3(tr)]]) ∈U × U × U such that tr = (∼ nei , p, ej)}

Freitas, Silva, Carvalho, O’Riain, Curry 40/ 54

Page 41: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Semantic Model - Formalization

Extracted Graph: set of basic and reified triples and generic,quantifier and co-reference operators.

Paths

Basic: sequence of basic triples

Reified: basic paths with some reified triples associated

Operational: basic paths with some operators associated

Complex: contains both reified and operational paths

Freitas, Silva, Carvalho, O’Riain, Curry 41/ 54

Page 42: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Semantic Model - Formalization

Context Triples: (tr , contextlink , ct) which indicates that abasic triple tr can be associated with a specific context ct[[context]] = ([[tr ]], [[contextlink ]], [[ct]]) ∈ U3 × U × U

Multi-Context Graphs: is an extracted graph with morethan one context associated to its triples

if all basic triples in a path belong to an unique (same)context, the path is an unique context (basic, reified,operational or complex) path

otherwise, we call this path a multi-context path

Freitas, Silva, Carvalho, O’Riain, Curry 42/ 54

Page 43: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Extraction

Graphia

Freitas, Silva, Carvalho, O’Riain, Curry 43/ 54

Page 44: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Graphia

http://graphia.dcc.ufrj.br

Graphia is an information extraction pipeline

Takes factual text as input and produces SDGs as output

Graphia’s modules combine state-of-art NLP tools with anefficient set of heuristics

Can build graphs for sentences andentire documents.

Freitas, Silva, Carvalho, O’Riain, Curry 44/ 54

Page 45: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Graphia - Extraction Pipeline Architecture

Freitas, Silva, Carvalho, O’Riain, Curry 45/ 54

Page 46: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Graphia - Preliminary Evaluation

1033 relations (triples) from 150 sentences from 5 randomlyselected Wikipedia articles

Manually classified the graphs

Correct Graphs: fully consistent with the semantic model

Complete Graphs: correct graph which maps all theinformation of a sentence

Interpretable Graphs: graph fragment which has the correctsemantics of its basic triple paths

Freitas, Silva, Carvalho, O’Riain, Curry 46/ 54

Page 47: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Graphia - Extraction Examples

In 2002 GE acquired the wind power assets of Enron during itsbankruptcy proceedings.

Freitas, Silva, Carvalho, O’Riain, Curry 47/ 54

Page 48: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Graphia - Extraction Examples

In 1935, GE was one of the top 30 companies tradedat the London Stock Exchange.

Freitas, Silva, Carvalho, O’Riain, Curry 48/ 54

Page 49: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Graphia - Extraction Examples

The Radio Corporation of America (RCA) was founded by GE in1919 to further international radio.

Freitas, Silva, Carvalho, O’Riain, Curry 49/ 54

Page 50: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Graphia - Extraction Examples

It acquired ScanWind in 2009.

Freitas, Silva, Carvalho, O’Riain, Curry 50/ 54

Page 51: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Graphia - Extraction Examples

The new company, named GXS, is based in Gaithersburg,Maryland.

Freitas, Silva, Carvalho, O’Riain, Curry 51/ 54

Page 52: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Conclusion & Future Work

Freitas, Silva, Carvalho, O’Riain, Curry 52/ 54

Page 53: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Conclusion & Future Work

Conclusion

SDGs provide a discourse representation as a set ofcontextualized relationships

Supports entity-centric integration between graphs

Conceptual Model Independency

Worth putting effort on enumerable patterns (timestamps,operators)

Future Work

More complex sentence patterns

Freitas, Silva, Carvalho, O’Riain, Curry 53/ 54

Page 54: Representing Texts as contextualized Entity Centric Linked Data Graphs

,

Representing Texts as Contextualized

Entity-Centric Linked Data Graphs

Andre Freitas1 Joao C. P. da Silva2 Danilo S. Carvalho2

Sean O’Riain1 Edward Curry1

1DERI - Ireland2Universidade Federal Rio de Janeiro - Brazil

August 27, 2013

Freitas, Silva, Carvalho, O’Riain, Curry 54/ 54