1 IOS Press From hyperlinks to Semantic Web properties using...

30
Undefined 1 (2016) 1–5 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction Editor(s): Stefan Schlobach, Vrije Universiteit Amsterdam, NL and Krzysztof Janowicz, University of California, Santa Barbara, USA Solicited review(s): Yingjie Hu, University of California, Santa Barbara, USA; Marieke van Erp, Vrije Universiteit Amsterdam, NL; two anonymous reviewers Valentina Presutti a,* , Andrea Giovanni Nuzzolese a and Sergio Consoli a and Aldo Gangemi a,b and Diego Reforgiato Recupero a a STLab, Institute of Cognitive Sciences and Technologies, National Research Council, via San Martino della Battaglia 44, 00185, Roma, Italy E-mail: [email protected] b LIPN, Université Paris 13 - Sorbonne Cité - CNRS E-mail: {andrea.nuzzolese, sergio.consoli, diego.reforgiato}@istc.cnr.it, [email protected] Abstract. Open information extraction approaches are useful but insufficient alone for populating the Web with machine readable information as their results are not directly linkable to, and immediately reusable from, other Linked Data sources. This work proposes a novel paradigm, named Open Knowledge Extraction, and its implementation (Legalo) that performs unsupervised, open domain, and abstractive knowledge extraction from text for producing directly usable machine readable information. The implemented method is based on the hypothesis that hyperlinks (either created by humans or knowledge extraction tools) provide a pragmatic trace of semantic relations between two entities, and that such semantic relations, their subjects and objects, can be revealed by processing their linguistic traces (i.e. the sentences that embed the hyperlinks) and formalised as Semantic Web triples and ontology axioms. Experimental evaluations conducted on validated text extracted from Wikipedia pages, with the help of crowdsourcing, confirm this hypothesis showing high performances. A demo is available at http://wit.istc.cnr.it/ stlab-tools/legalo. Keywords: open knowledge extraction, open information extraction, abstractive summarisation, link semantics, relation extraction. 1. Populating the Semantic Web from natural language text The vision of the Semantic Web is to populate the Web with machine understandable data so that intelli- gent agents will be able to automatically interpret its content - just like humans do by inspecting Web con- tent - and assist users in performing a significant num- ber of tasks, relieving them of cognitive overload. The Linked Data movement [2] realised the first substantiation of this vision by bootstrapping the publication of machine understandable information, * Corresponding author. E-mail: [email protected] mainly taken from structured data (typically databases) or semi-structured data (e.g. Wikipedia infoboxes). However, a large part of the Web content consists of natural language text, hence a main challenge is to ex- tract as much relevant knowledge as possible from this content, and publish it in the form of Semantic Web triples. This work aims to solve this problem by ex- tracting relational knowledge that is “hidden” in hy- perlinks, which can be either defined manually by hu- mans (e.g. Wikipedia pagelinks) or created automat- ically by Knowledge Extraction (KE) systems (e.g. a KE system can automatically add links to Wikipedia pages or to local datasets of Semantic Web entities). Current KE systems address the task of linking pieces of text to Semantic Web entities very well (e.g. 0000-0000/16/$00.00 c 2016 – IOS Press and the authors. All rights reserved

Transcript of 1 IOS Press From hyperlinks to Semantic Web properties using...

Page 1: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

Undefined 1 (2016) 1–5 1IOS Press

From hyperlinks to Semantic Web propertiesusing Open Knowledge ExtractionEditor(s): Stefan Schlobach, Vrije Universiteit Amsterdam, NL and Krzysztof Janowicz, University of California, Santa Barbara, USASolicited review(s): Yingjie Hu, University of California, Santa Barbara, USA; Marieke van Erp, Vrije Universiteit Amsterdam, NL; twoanonymous reviewers

Valentina Presutti a,∗, Andrea Giovanni Nuzzolese a and Sergio Consoli a and Aldo Gangemi a,b andDiego Reforgiato Recupero a

a STLab, Institute of Cognitive Sciences and Technologies, National Research Council, via San Martino dellaBattaglia 44, 00185, Roma, Italy E-mail: [email protected] LIPN, Université Paris 13 - Sorbonne Cité - CNRSE-mail: {andrea.nuzzolese, sergio.consoli, diego.reforgiato}@istc.cnr.it, [email protected]

Abstract. Open information extraction approaches are useful but insufficient alone for populating the Web with machine readableinformation as their results are not directly linkable to, and immediately reusable from, other Linked Data sources. This workproposes a novel paradigm, named Open Knowledge Extraction, and its implementation (Legalo) that performs unsupervised,open domain, and abstractive knowledge extraction from text for producing directly usable machine readable information. Theimplemented method is based on the hypothesis that hyperlinks (either created by humans or knowledge extraction tools) providea pragmatic trace of semantic relations between two entities, and that such semantic relations, their subjects and objects, canbe revealed by processing their linguistic traces (i.e. the sentences that embed the hyperlinks) and formalised as Semantic Webtriples and ontology axioms. Experimental evaluations conducted on validated text extracted from Wikipedia pages, with the helpof crowdsourcing, confirm this hypothesis showing high performances. A demo is available at http://wit.istc.cnr.it/stlab-tools/legalo.

Keywords: open knowledge extraction, open information extraction, abstractive summarisation, link semantics, relationextraction.

1. Populating the Semantic Web from naturallanguage text

The vision of the Semantic Web is to populate theWeb with machine understandable data so that intelli-gent agents will be able to automatically interpret itscontent - just like humans do by inspecting Web con-tent - and assist users in performing a significant num-ber of tasks, relieving them of cognitive overload.

The Linked Data movement [2] realised the firstsubstantiation of this vision by bootstrapping thepublication of machine understandable information,

*Corresponding author. E-mail: [email protected]

mainly taken from structured data (typically databases)or semi-structured data (e.g. Wikipedia infoboxes).However, a large part of the Web content consists ofnatural language text, hence a main challenge is to ex-tract as much relevant knowledge as possible from thiscontent, and publish it in the form of Semantic Webtriples. This work aims to solve this problem by ex-tracting relational knowledge that is “hidden” in hy-perlinks, which can be either defined manually by hu-mans (e.g. Wikipedia pagelinks) or created automat-ically by Knowledge Extraction (KE) systems (e.g. aKE system can automatically add links to Wikipediapages or to local datasets of Semantic Web entities).

Current KE systems address the task of linkingpieces of text to Semantic Web entities very well (e.g.

0000-0000/16/$00.00 c© 2016 – IOS Press and the authors. All rights reserved

Page 2: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

2 V. Presutti et al. / From hyperlinks to Semantic Web properties

owl:sameAs) by means of named entity linking meth-ods, e.g. NERD1 [41], FOX2, conTEXT3 [24], Dbpe-dia Spotlight4, Stanbol5, TAGME [14], Babelfy [29].Some of them (e.g. NERD) also perform sense tag-ging, i.e. adding knowledge about entity types (rdf:type).

Nevertheless, it is desirable to enrich Web contentwith other semantic relations than owl:sameAs andrdf:type, i.e. factual relations between entities. Apragmatic trace of a factual relation between two en-tities is the presence of a hyperlink, which is associ-ated with its linguistic trace, i.e. the text surroundingthe hyperlink. In fact, when we include a link in a Webpage, we usually have a semantic relation in mind be-tween something we are referring within the page, i.e.subject, and something referred by the target page, i.e.object, and the text where the hyperlink is embeddedoften provides an explanation of what such relation is.For example, a link to “Usenet” in the Wikipedia pageof “John McCarthy” suggests a semantic relation be-tween those two entities, which is explained by thesentence: “McCarthy often commented on world af-fairs on the Usenet forums”6.

Besides common sense, this hypothesis is also sup-ported by a previous study [33], which describes theextraction of encyclopedic knowledge patterns forDBpedia types, based on links between Wikipediapages. A user study showed that hyperlinks betweenWikipedia pages determine relevant descriptive con-texts for DBpedia entities at the type level, which sug-gests that these links mirror relevant semantic relationsbetween entities.

A hyperlink in a Web page can be produced either bya human or a KE system (e.g., by linking a piece of textto a Wikipedia page, which in turn refers to a Seman-tic Web entity, i.e. a DBpedia entity). If a KE systemrecognises two or more entities in a sentence, there is apossibility that such sentence expresses some relationbetween them. For example, the following sentence:

The New York Times reported that John McCarthydied. He invented the programming language LISP.

can be automatically enriched using a KE system bylinking the text fragments “The New York Times”,

1http://nerd.eurecom.fr2http://aksw.org/Projects/FOX.html3http://context.aksw.org/app/4http://dbpedia-spotlight.github.com/demo5http://stanbol.apache.org6Cf. http://en.wikipedia.org/wiki/John\

_McCarthy\_(computer\_scientist)

“John MacCarthy”, and “LISP” to the Wikipedia pageswikipedia:The_New_York_Times7, wikipedia:-John_McCarthy_(computer_scientist) and wi-kipedia:Lisp_(programming_language) (respec-tively), resulting in the following:

The New York Times reported that John McCarthydied. He invented the programming language LISP.

In this example, the three hyperlinks identify entitiesthat are relevantly related by factual relations: “JohnMcCarthy” with “The New York Times”, and “JohnMcCarthy” with “LISP”. By generalising this concept,any recognised named entity in a sentence (even if notassociated with an existing Web URI) can be treated asa potential hyperlink target (e.g. to a local knowledgebase). In the rest of the paper we use examples withentities that can be resolved to DBpedia, for the sake ofsimplicity. Revealing the semantics of hyperlinks (ei-ther defined by humans or KE systems) has a high po-tential impact on the amount of Web knowledge thatcan be published in machine readable form.

In the Semantic Web era, such factual relationsshould be expressed as RDF triples where subjects,objects, and predicates have a URI (except for lit-eral objects and blank nodes), and predicates are for-malised as RDF/OWL properties, in order to facilitatetheir reuse and alignment to existing vocabularies, andfor example to annotate hyperlinks with RDFa, withinHTML anchor tags.

While subjects and objects can be mostly directly re-solved through existing public or local Semantic Webentities, predicates are to be defined by performing“paraphrasing”, a summarisation task that abstractsover the text (when needed) in order to design la-bels that are as close as possible to what a humanwould design for a Linked Data vocabulary. In thisrespect, [40] distinguishes between extractive and ab-stractive summarisation approaches. Extractive meth-ods select pieces of texts from the original sourcein order to define a summary (i.e. they rely only onthe available text), while abstractive techniques ideallyrely on modeling the text, and then combining it withother resources and language generation techniques forgenerating a summary. Abstractive methods are usu-ally applied to large documents to the aim of producinga meaningful summary of their content.

This work proposes to apply the guiding principleof abstractive techniques to open information extrac-

7wikpedia: stands for http://en.wikipedia.org/wiki/

Page 3: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 3

Subject Predicate Object ApproachJohn Stigall received a Bachelor of arts extractiveJohn Stigall received from the State University of New York

at Cortlandextractive

dbpedia:John_Stigall myprop:receive_academic_degree dbpedia:Bachelor_of_arts abstractivedbpedia:John_Stigall myprop:receive_academic_degree_from dbpedia:State_University_of_New_York abstractive

Table 1: Comparison between relations resulting from extractive and abstractive approaches for the sentence“John Stigall received a Bachelor of arts from the State University of New York at Cortland”.

tion as a novel contribution. Open information extrac-tion refers to an open domain and unsupervised ex-traction paradigm. Existing open information extrac-tion approaches are mainly extractive, hence showinga complimentary nature to what we present in this pa-per. They mostly focus on breaking text in meaning-ful fragments for building resources of relational pat-terns (e.g. PATTY [30]8, Wisenet [28]9), in some casesdisambiguated on external semantic resources such asWordNet10. Others focus on extracting facts, which arerepresented as simplified strings between entities (e.g.Open Information Extraction (OIE) [26,12]11) that arenot given a Semantic Web identity.

Knowledge extraction for the Semantic Web shouldinstead include an abstractive step, which exploits aformal semantic representation of text, and producesoutput that is compliant with Semantic Web principlesand requirements. The method described in this paperdemonstrates this novel approach, called open knowl-edge extraction (OKE). For example, given the sen-tence:

John Stigall received a Bachelor of arts from theState University of New York at Cortland.

Table 1 compares the extracted relations resulting froman extractive approach (such as OIE [26]12) - the firsttwo rows - and from an abstractive approach - the lasttwo rows. The abstractive results exemplify the ex-pected result of a OKE system. The main difference isthat with the abstractive approach, subjects and objectsare identified as Semantic Web entities, the predicate isas close as possible to what a human would define fora Linked Data vocabulary by possibly using terms thatare not mentioned in the original text. In addition towhat Table 1 shows, the predicate would be formally

8https://d5gate.ag5.mpi-sb.mpg.de/pattyweb/

9http://lcl.uniroma1.it/wisenet/10http://wordnet.princeton.edu/11http://openie.cs.washington.edu/12Notice that this is the output of OIE for this sentence

defined in terms of OWL axioms and possibly alignedwith existing Semantic Web vocabularies.

1.1. Contribution

The main contributions of this work are:

– the introduction of Open Knowledge Extraction(OKE), a paradigm based on unsupervised, opendomain, and abstractive knowledge extractionfrom text for producing directly usable machinereadable information;

– an implementation of OKE, named Legalo thatgiven an English sentence produces a set of RDFtriples representing relevant factual relations ex-pressed in the sentence, the predicates of whichare formally defined in terms of OWL axioms;

– an evaluation of Legalo performed on a corpusof validated sentences from Wikipedia pages thatprovide evidence of factual relations. The resultshave been evaluated with the help of crowdsourc-ing and the creation of a gold standard, all show-ing high values of precision, recall, and accuracy;

– a discussion highlighting the current limits of theapproach and possible ways of improving it, andincluding an informal comparison of the proposedmethod with one of the main existing open infor-mation extraction tools.

Additionally, the paper includes a brief descrip-tion of a specific implementation of OKE, specialisedfor extracting the semantics of Wikipedia pagelinks,which has been evaluated in [38] showing promisingresults.

The paper is structured as follows: Section 2 intro-duces a novel paradigm named Open Knowledge Ex-traction. Sections 3 and 4 describe the implementa-tion of an OKE system, named Legalo, focusing onthe method implemented and the pipeline of compo-nents, respectively. Legalo has been evaluated with thehelp of crowdsourcing, as described in Section 5. Sec-tion 6 discusses the limits of the method and possible

Page 4: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

4 V. Presutti et al. / From hyperlinks to Semantic Web properties

ways to improve it, and informally compares Legalowith Open Information Extraction (OIE) [26]. Section7 discusses relevant research work and finally, Section8 summarises the contribution of this work and indi-cates future works.

2. Introducing Open Knowledge Extraction

According to [12], an Open Information Extrac-tion (OIE) system: “facilitates domain independentdiscovery of relations extracted from text and readilyscales to the diversity and size of the Web corpus”. Inother words, OIE revolutionised the information ex-traction paradigm by introducing unsupervised learn-ing, domain-independence of the extracted relations,and the ability to scale both on size and heterogeneitydimensions of the Web. The Open Knowledge Extrac-tion (OKE) paradigm poses its focus on making theextracted relations directly usable in a Semantic Webcontext.

An Open Knowledge Extraction (OKE) system isexpected to perform unsupervised, open domain, andweb scale extraction and to additionally have the fol-lowing capabilities:

Relation assessment To assess if a natural languagesentence provides an evidence of a relevant rela-tion between a given pair of entities, which maybe identified by hyperlinks; relevant here meansthat there are enough explicit traces in the sen-tence to support the existence of a (conceptual)relation;

Label generation To generate a predicate for this re-lation, with a label that is as close as possible towhat a human would define for a Linked Data vo-cabulary;

Property formalisation To formalise this relation asan OWL object property with TBox axioms (con-ceptual level), as well as to produce ABox axioms(factual level) using that property.

More formally:

Definition 1. (Relevant relation)Let s be a natural language textual sentence embed-

ding some hyperlinks, and (esubj , eobj) a pair of enti-ties mentioned in s, where esubj and eobj are the targetentities referred by two hyperlinks in s, ϕs(esubj , eobj)is a relevant relation between esubj and eobj , expressedin s, with esubj being the subject of ϕ and eobj be-ing its object. Λ ≡ {λ1, ..., λn} is a set of Linked

Data labels generated by humans for ϕs(esubj , eobj).Finally, λ

′is a label generated by an OKE system for

ϕs(esubj , eobj).

An OKE system is able to assess the existence ofϕs(esubj , eobj), to generate a label λ

′equal or very

similar to λi ∈ Λ, and to formalise it as a SemanticWeb property. Notice that not all relations are binary,for example events have time and space indexing, thereare relations that naturally take more than two argu-ments e.g. Mary gave a book to John, as present forhis birthday. For this reason an OKE system has totake into account the n-ary nature of relations and copewith expressing them as triples, given the pragmaticconstraint of Semantic Web standard languages. Thisimpacts on the complexity to assess the existence ofϕs(esubj , eobj) and to generate an adequate label for it,especially when ϕs is the projection of a n-ary relation.

3. Legalo: an OKE implementation that generatesSemantic Web properties from text

One of the main contributions of this paper is theimplementation (and evaluation, cf. Section 5) of anOKE system, named Legalo13.

The method implemented by Legalo is based on sixmain steps:

1. internal formal representation of the sentence(abstractive step);

2. assessment of the existence of a relevant relationbetween pairs of entities identified in s, accord-ing to the content of the sentence;

3. extraction of relevant terms for the predicate (ex-tractive step);

4. generation of the predicate label (abstractivestep);

5. formal definition of the predicate within thescope of its linguistic evidence and formal repre-sentation (abstractive step) ;

6. alignment (whenever possible) to existing Se-mantic Web properties.

3.1. Frame-based formal representation of a sentence

Legalo relies on a set of rules to be applied to a frame-based formal representation G of the sentence s (cf.Definition 2). G is a RDF graph designed following a

13A demo of Legalo is available at http://wit.istc.cnr.it/stlab-tools/legalo/

Page 5: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 5

Fig. 1.: Frame-based formal representation for the sentence: “The New York Times reported that John McCarthydied.”

frame-based approach, where nodes represent entitiesmentioned in s.

Definition 2. (Frame-based graph)Let s be a natural language text sentence and G =

(V,E) a RDF (directed, multi-) graph modelling aframe-based formal representation of s, where V ≡{v0, ..., vn} is the set of nodes (i.e. subjects and ob-jects from RDF triples) in G, E ≡{edge1, ..., edgen}is the set of edges (i.e. RDF triples) in G, whereedgei=(vi−1, p, vi), is a triple connecting vi−1 and viwith the RDF property p in G, and vi ∈ V is the nodein G representing the entity ei mentioned in s.

Frame Semantics [15] is a formal theory of meaning:its basic idea is that humans can better understand themeaning of a single word by knowing the relationalknowledge associated to that word. For example, thesense of the word buy can be clarified in a certain con-text or task by knowing about the situation of a com-mercial transfer that involves certain individuals play-ing specific roles, e.g. a seller, a buyer, goods, money,etc.

In this work, frames are usually expressed by verbsor other linguistic constructions, and their occurrencesin a sentence are represented as RDF n-ary relations,all being instances of some type of event or situ-ation (e.g. myont:buy_1 rdf:type myont:Buy),which is on its turn represented as a subclass of

dul:Event14. Intuitively, dul:Event is the top cat-egory of all frames expressed by verbs. In the contextof this paper, the terms frame occurrence and event oc-currences are used as synonyms. Entities that are men-tioned in s are represented as individuals or classes,depending on their nature, which (ideally) have a type,defined based on the information available in the sen-tence s. When appropriate, entities are represented asarguments of n-ary relations, according to the role theyplay in the corresponding frame occurrence. The roleof an entity in an event occurrence can be expressedeither by a preposition, e.g. Rico Lebrun taught at theChouinard Art Institute, or it can be abstracted fromthe text and represented by reusing the set of thematicroles defined by VerbNet [42], e.g. Rico Lebrun is theagent of the event occurrence “teach” in the abovesample sentence.

A formal and detailed discussion of the theory be-hind frame-based formal representation of knowledgeextracted from text, and used by Legalo is beyond thescope of this paper. This modeling approach and itsfounding theories are extensively described in [15,39,32]. However, an example may be useful to convey theintuition behind the theory. Figure 1 shows a frame-based representation of the sentence:

The New York Times reported that John McCarthydied.

14The prefix dul: stands for http://www.ontologydesignpatterns.org/ont/dul/dul.owl\#

Page 6: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

6 V. Presutti et al. / From hyperlinks to Semantic Web properties

The knowledge extracted from the sentence s is for-malised as a set of RDF triples G. The figure is de-rived from the output of FRED [39] (see Section4), the component providing the frame-based for-mal representation within Legalo. The prefix fred:

stands for a local configurable namespace. Two en-tities can be identified in this sentence, i.e. “NewYork Times” and “John McCarthy”, represented inG as individuals i.e. fred:New_York_Times andfred:John_McCarthy, respectively. Two frame oc-currences can be identified in the sentence: one ex-pressed by (an inflected form of) the verb report andthe other expressed by the verb die. These frameoccurrences are represented as n-ary relations: i.e.,fred:report_1 and fred:die_1, both being in-stances of classes (fred:Report and fred:Die re-spectively) that are of type dul:Event. Let us con-sider the event occurrence fred:report_1. Its argu-ments are: (i) fred:New_York_Times, which playsan agentive role in this event occurrence, formallyexpressed by the predicate vn.role:Agent15, andfred:John_McCarthy, who plays a passive role, for-malised by the predicate vn.role:Theme, both Verb-Net thematic roles.

3.2. Relevant relation assessment

To assess if a relevant relation ϕs(esubj , eobj) exists ins between a pair of entities (esubj , eobj), Legalo relieson the analysis of the semantic structure of G. Firstly,ϕs(esubj , eobj) is assumed to hold only if there is atleast one path in G connecting vsubj and vobj , i.e. thenodes representing esubj and eobj in G, regardless ofthe edge direction in G. This is formally expressed byAxiom 1, given Definition 3.

Definition 3. (Graph path)G

′= (V,E

′) is the undirected version of G= (V,E).

A path P (vsubj , vobj)=[v0, edge1, ..., edgen, vn] withv0 = vsubj and vn = vobj is any sequence alternatingnodes and edges inG

′connecting vsubj to vobj , or vice

versa. The set Psetsubj,obj ≡ {edge1, v1, ..., edgen}includes all edges and nodes in P (vsubj , vobj) exclud-ing vsubj and vobj .

Axiom 1. (ϕ assessment: necessary condition)

ϕs(esubj , eobj)⇒ ∃P (vsubj , vobj)

15Prefix vn.role: stands for http://www.ontologydesignpatterns.org/ont/vn/abox/role/,which defines all VerbNet [42] thematic roles.

If P (vsubj , vobj) exists, Legalo distinguishes whetherP (vsubj , vobj) contains an event occurrence, or not. IfP (vsubj , vobj) does not contain any event occurrence,then the existence of P (vsubj , vobj) is a sufficient con-dition to the existence of ϕs(esubj , eobj) (cf. Axiom2).

Axiom 2. (Assessment of ϕ: sufficient condition with-out event occurrences)ϕs(esubj , eobj)⇐ ∃P (vsubj , vobj) such that∀vi∈Psetsubj,obj ,¬dul:Event(vi)

In the other case, i.e. the path includes an event oc-currence, ϕs(esubj , eobj) exists if esubj is the subjectof the event verb in the sentence. In the graph G thismeans that the node vsubj representing esubj in G par-ticipates in the event occurrence with an agentive role.This is formalised by Axiom 3, given Definition 4.

Definition 4. (Agentive roles)Let f be a node of G such that dul:Event(f)Role ≡ {ρ1, ..., ρn} is the set of possible roles par-ticipating in f , AgRole ≡ {ρm1

, ..., ρmm} is the set

of VerbNet agentive roles, with AgRole ⊆ Role, andρ(f, vsubj) is a role connecting the event occurrence fto its participant vsubj (the node representing esubj) ins.

Axiom 3. (Assessment of ϕ with event occurrences:sufficient condition)ϕs(esubj , eobj)⇐ ∃P (vsubj , vobj) and∃f ∈Psetsubj,obj such that dul:Event(f)and ρ(f, vsubj)∈AgRole

This axiom is based on linguistic typology results (e.g.[7]), by which SVO (Subject-Verb-Object) languagessuch as English have almost always an explicit (or ex-plicitatable) subject. This subject is formalized in aframe-based representation of s by means of an agen-tive role. Based on this observation, our method as-sumes that ϕs(esubj , eobj) exists if esubj is the sub-ject of a verb in s. This axiom is potentially restrictivewith respect to the idea of a relevant relation expressedin a sentence, which may consider any pair of entitiesas related just because they are mentioned in a samesentence. In fact, this idea is quite difficult to imple-ment, since relations between pairs of entities that playe.g. oblique roles (oblique roles are neither agentive orpassive, e.g. “manner”, “location”, etc.) in a frame oc-currence are hard to paraphrase even for a human. Forexample, consider the sentence:

Page 7: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 7

After a move to Southern California in 1938,Rico Lebrun taught at the Chouinard Art Instituteand then at the Disney Studios.

the frame-based representation of this sentence, de-picted in Figure 2 identifies Rico Lebrun as the agentof a “teach” frame occurrence, while Southern Cal-ifornia, Chouinard Art Institute, and Disney Studiosparticipate in it with oblique roles. This sentence ex-presses three relevant relations: one between Rico Le-brun and Chouinard Art Institute, one between RicoLebrun and Disney Studios, and another between RicoLebrun and Southern California. All those relationscan be summarised and represented as RDF triples, byLegalo.

While it is correct to state that Chouinard Art In-stitute and Disney Studios co-participate in an occur-rence of the frame “teach”, it is far from straightfor-ward to paraphrase the meaning of this relation. E.g.,one might say that Chouinard Art Institute and Dis-ney Studios are both places where Rico Lebrun used toteach, but this paraphrase is not easily reconstructablefrom the text, and needs a stronger language generationapproach, which has not been tackled for the moment.Additionally, such a paraphrase would not be usablefor a binary predicate. A way to represent this relationis a generic co-participation relation, which is howevertoo generic to be considered as relevant.

For this reason, the investigation of paraphrases ofrelation between entities co-participating in an eventwith oblique roles is left to further study. An interest-ing analysis on this problem that could suggest newwork directions is discussed in [8].

3.3. Combining extractive and abstractive design forproperty label generation

As far as the generation of λ′

is concerned (cf. Def-inition 1), Legalo combines extractive with abstractivetechniques [40]. It means that it both reuses the termsin the text (extractive) and generates other terms de-rived from a semantic analysis of the text (abstractive).To this aim, it uses the semantic information providedby the frame-based representation G of the sentences, which is further enriched with knowledge retrievedfrom external semantic resources. Legalo relies on thefollowing knowledge resources:

– DBpedia [3] is the RDF version of Wikipedia andis used for resolving (disambiguating) the nodes{vi} ∈ V that represent the entities {ei} in thesentence s, on Linked Data;

– Schema.org16 is a set of vocabularies for clas-sifying entities on the Web. Schema.org is pro-moted by the most important search engines(Google, Yahoo!, Bing, and Yandex) making it areference resource of its kind, which is why wedecided to use it in Legalo for typing the recog-nised entities;

– WiBi [16] is a Wikipedia bitaxonomy: a re-fined, rich and high quality taxonomy that inte-grates Wikipedia pages and categories. Legalouses WiBi as a reference semantic resource fordesigning the labels of generated properties, inparticular for its “paraphrasing” task (i.e. abstrac-tive step), when the extracted terms are too gen-eral to be informative enough; WiBi resulted thebest resource to be used for this task as comparedto the DBPedia ontology, and YAGO based onempirical tests conducted on a sample of ∼ 200sentences;

– VerbNet [42] is the largest domain-independenthierarchical verb lexicon, available for English,which is one of the reasons why we decided to useit. Additionally, Legalo inherits it from FRED, itscore component. VerbNet is organised into verbclasses. Each verb class is described by thematicroles, selectional restrictions on the arguments,and frames. From VerbNet, Legalo obtains thethematic roles played by the DBpedia entities par-ticipating in a frame occurrence. Additionally, itis used for disambiguating the sense of frame oc-currences. A subset of VerbNet thematic roles aremapped to specific prepositions, which are usedin the paraphrasing task. The map is providedlater in this section (cf. GR 3).

For example, consider the sentence:

In February 2009 Evile began the pre-productionprocess for their second album with Russ Russell.

Figure 3 shows the enriched frame-based formalrepresentation of this sentence. The graph does notshow WiBi types but they are actually retrieved byLegalo, for each resolved DBpedia entity. Two enti-ties are resolved on DBpedia, i.e. dbpedia:Evile,and dbpedia:Russ_Russell, and two frame occur-rences are identified, i.e. fred:begin_1 andfred:process_1. Furthermore, each node is as-signed with a type that, when possible, is alignedto existing Linked Data vocabularies. For Example,

16http://schema.org/

Page 8: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

8 V. Presutti et al. / From hyperlinks to Semantic Web properties

Fig. 2.: Frame-based formal representation for the sentence: “After a move to Southern California in 1938,Rico Lebrun taught at the Chouinard Art Institute and then at the Disney Studios”. Legalo will select the pairs ofentities (fred:Rico_lebrun, Chouinard_art_institute), (fred:Rico_lebrun, fred:Disney_studios),and (fred:Chouinard_art_institute, fred:Southern_California)

Fig. 3.: Frame-based formal representation for the sentence: “In February 2009 Evile began the pre-productionprocess for their second album with Russ Russell” The graph is enriched with verb senses to disambiguate frametypes, DBpedia entity resolutions, thematic roles played by DBpedia entities participating in frame occurrences, andentity types.

dbpedia:Evile has type schema.org:MusicGroup(Prefix schema.org: stands for http://schema.org), and the entity fred:album_1 (representing thealbum mentioned in the sentence) is typed by the tax-onomy fred:SecondAlbum rdf:type fred:Album.Following Axiom 1 and Axiom 3 (cf. Section 3.2),Legalo will select from the graph of Figure 3 the pairof (DBpedia) entities:dbpedia:Evile, dbpedia:Russ_RussellThe Legalo design strategy for generating predicate la-bels is based on three main generative rules (GR). Thefirst one concerns the concatenation of the labels thatare used in the shortest path connecting the two nodes,

including the labels of the edges and the labels of thenode types in the path. This rule is defined by GR 1. Itis important to remark that the path used as a referencefor generating the predicate label is the one connect-ing the nodes vsubj and vobj and not the correspondingresolved DBpedia entities.

GR 1. (Labels concatenation)Given a pair (vsubj , vobj):

– identify the shortest path(s) P (vsubj , vobj) con-necting vsubj and vobj ;

– extract all labels (matching sentence terms) of theedges in the path;

Page 9: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 9

– extract all labels of the most general types of thenodes that compose the path (if a node is typed bya taxonomy, the most general type in the taxon-omy is extracted), except the types of vsubj andvobj ;

– concatenate the extracted labels following theiralternating sequence in P (vsubj , vobj).

Hence, referring to Figure 3, Legalo will produce apredicate label λ = “begin process for album with”for expressing ϕs(Evile, Russ Russell). Notice thatthe only labels that are included in the concatenationare those with prefix fred: meaning that they are ex-tracted from s.

The second rule for generating predicate labels takesinto account the possible presence of an event occur-rence in the path connecting the pair (vsubj , vobj). In-tuitively, in this case the path is a tree, rooted in anevent occurrence, i.e. a node f , such as dul:Event(f).The labels in this cases are extracted only from thepath starting from f and ending in vobj (referred asthe right branch of the tree), including also the label off type. The rationale behind this rule is that the rightbranch of the tree including the root event (i.e. its type)provides the relevant information expressing the rela-tion between the two nodes, according to an empiricalobservation conducted on a sample of ∼200 cases.

For example, consider the (excerpt of the) frame-based representation of the sentence “Joey Foster Ellishas published on The New York Times, and The WallStreet Journal.” shown in Example 3.1

Example 3.1. (Path including an event)

fred:publish_1 rdf:type fred:Publish;

vn.role:Agent fred:Joey_Foster_Ellis;

fred:on fred:New_York_Times;

fred:on fred:Wall_Street_Journal .

fred:Publish rdfs:subClassOf dul:Event .

fred:Joey_Foster_Ellis

owl:sameAs dbpedia:Joey_Foster_Ellis .

fred:New_York_Times

owl:sameAs dbpedia:The_New_York_Times .

fred:Wall_Street_Journal

owl:sameAs dbpedia:Wall_Street_Journal.

Following GR 1 and applying this additional rule forthe selected pair:dbpedia:Joey_Foster_Ellis,

dbpedia:Wall_Street_Journal

leads to a predicate λ = “publish on” forϕs(Joey Foster Ellis,Wall Street Journal).

Additionally, if the right branch of the tree path is oflength 1 and the only edge is a passive role, i.e. vobjparticipates with a passive role to f , the label of theWiBi type of vobj is concatenated to the predicate la-bel. The rationale behind this rule is that when vsubjand vobj play respectively an agentive and a passiverole in an event occurrence, the resulting predicate la-bel following only GR 1 would be too generic, henceadding the WiBi type label makes the property labelmore specific and informative.

For example, a frame-based representation of thesentence “Elton John plays the piano” is given in Ex-ample 3.2:

Example 3.2. (Right branch of tree path with onlypassive role)

fred:play_1 rdf:type fred:Play;

vn.role:Agent fred:Elton_John;

vn.role:Theme fred:piano_1 .

fred:Elton_John

owl:sameAs dbpedia:Elton_John .

fred:piano_1 rdf:type dbpedia:Piano .

fred:Play rdfs:subClassOf dul:Event .

dbpedia:Piano

rdf:type wibi:MusicalInstrument .

If we apply the additional rules described so far tothe pair (dbpedia:Elton_John, dbpedia:Piano)

we obtain a label λ = “play musical instrument” forϕs(Elton John, piano), which is more informativethan a simple “play” that would result without addingthe WiBi type label of dbpedia:Piano. This rule isdefined by GR 2.

GR 2. (Path including event occurrences)Given a selected pair (vsubj , vobj) and the shortest

path P (vsubj , vobj) connecting them. If P (vsubj , vobj)is a tree rooted in f , such as dul:Event(f),

– extract P (f, vobj) from P (vsubj , vobj);– extract all edge labels in P (vsubj , vobj) that

match with terms extracted from s;– for each vi (including f and excluding vobj) inP (f, vobj) extract the label of its more generaltype;

– concatenate the extracted labels following theiralternating sequence in P (f, vobj);

– if P (f, vobj) has only 1 edge (length = 1), andthis edge identifies a VerbNet passive role, thanextract the WiBi type of vobj and append it to thelabel concatenation.

Page 10: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

10 V. Presutti et al. / From hyperlinks to Semantic Web properties

The third rule for predicate label generation comple-ments GR 1 and GR 2 by associating VerbNet rolesto labels. Such labels have been defined top-down byanalysing VerbNet thematic roles and their usage ex-amples. The rule is defined in GR 3.

GR 3. (Thematic roles labels)If a path contains a VerbNet thematic role, replace

its label with an empty one, unless the role is associ-ated with a non empty label according to the followingscheme:vn.role:Actor1 -> “with”vn.role:Actor2 -> “with”vn.role:Beneficiary -> “for”vn.role:Instrument -> “with”vn.role:Destination -> “to”vn.role:Topic -> “about”vn.role:Source -> “from”

For example, consider the (excerpt of the) frame-based representation of the sentence “Lincoln’s wifesuspects that John Wilkes Booth and Andrew Johnsonconspired to kill Lincoln.” shown in Example 3.3.

Example 3.3. (Thematic roles associated with labels)

fred:conspire_1 rdf:type fred:Conspire;

vn.role:Actor1 fred:Andrew_Johnson;

vn.role:Actor2 fred:John_Wilkes_Booth .

fred:Conspire rdfs:subClassOf dul:Event .

fred:Andrew_Johnson

owl:sameAs dbpedia:Andrew_Johnson .

fred:John_Wilkes_Booth

owl:sameAs dbpedia:John_Wilkes_Booth;

fred:Lincoln

owl:sameAs dbpedia:Abraham_Lincoln .

By applying GR 1, 2 and 3 to the path connecting thepair:dbpedia:Andrew_Johnson,

dbpedia:John_Wilkes_Booth

Legalo generates a label λ = “conspire with” forϕs(Andrew Johnson, John Wilkes Booth).The mapping scheme (role<->label) is an evolving re-source, which improves based on the periodic evalua-tion of Legalo outputs.

3.4. Formalisation of extracted knowledge

Given a textual sentence s and its frame-based formalrepresentationG, by following the generative rules de-scribed in Section 3.3 Legalo generates a label λ for

each relationϕs(esubj , eobj) that it is able to identify ins, based on the shortest path P (vsubj , vobj) connecting(vsubj , vobj) in G (cf. Definitions 1, 2, and 3). Theselabels constitute the basis for automatically generatinga set of RDF triples that can be used for semanticallyannotating the hyperlinks included in s. Additionally,these set of triples provides a (formalised) summary ofs.

The aim of the formalisation step is to favour thereuse of the extracted knowledge by representing it asRDF triples, by augmenting it with informative anno-tations and axiomatisation, and by linking it to exist-ing Semantic Web data. In particular, the formalisationstep addresses the following tasks:

– producing a RDF triple (vsubj , pλ, vobj) for eachhyperlink in s associated with eobj , such thatϕs(esubj , eobj) exists in s, where pλ is a predicatehaving label λ, vsubj is the node in G represent-ing esubj , and vobj is the node in G representingeobj ;

– formally defining pλ: its domain and range, andpossible other OWL axioms that specify its for-mal semantics;

– annotating each triple (vsubj , pλ, vobj) with infor-mation about its linguistic evidence, i.e. the sen-tence s;

– annotating each triple and predicate with informa-tion about the frame-based formal representationfrom which they were extracted.

RDF triples can be used for annotating hyperlinks, e.g.with RDFa, OWL axiomatisation supports ontologyreuse, and scope annotations (i.e. linguistic evidenceand formal representation) support reuse in relation ex-traction systems, e.g. relation extraction based on dis-tant supervision [27,1]

Locality of produced predicates. Our method workson the assumption that each generated predicate andits associated formalisation are valid in the conceptualscope identified by the sentence s. This means that sidentifies the scope of predicate names definitions, i.e.the namespace of a predicate depends on s. Pragmat-ically, this is implemented in Legalo by including thechecksum of s in the predicate namespace. This stronglocality constraint may lead to producing a high num-ber of potentially equivalent properties (i.e. having thesame intensional meaning) defined as they were dif-ferent. This issue is tackled by formalising all pred-icates with domain and range axioms having values,i.e. classes, from external (open domain) resources, as

Page 11: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 11

Fig. 4.: Frame-based formal representation for the sentence: “The New York Times reported the death ofJohn McCarthy. He invented LISP.”

Fig. 5.: Legalo’s triples produced from the sentence:“The New York Times reported the death of John McCarthy.He invented LISP. ”

well as by keeping the binding between a predicate,its linguistic evidence, i.e. s, and its formal represen-tation source, i.e. G. The latter contains informationabout the disambiguated senses of the verbs, i.e. frameoccurrences, used in s. All these features allow on onehand to inspect a specific property for understandingits meaning, e.g. in case of manual reuse, on the otherhand to automatically reconcile predicates by comput-ing a similarity measure based on them. In this paper,we focus on the generative part of the problem, i.e.generating usable labels for predicates and producing

their formal definition, while we leave the reconcilia-tion task to future work.

RDF factual statements. For each hyperlink in s as-sociated with a true assessment of ϕs(esubj , eobj) (cf.Axioms 1, 2, and 3), Legalo produces at least one RDFtriple. As explained in Section 3.3, the nodes vsubj andvobj in G representing esubj and eobj are resolved onDBpedia, when possible, which links the triples to theLinked Data cloud. The predicate is formalised as anOWL object property having λ

′as label and an ID de-

Page 12: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

12 V. Presutti et al. / From hyperlinks to Semantic Web properties

rived by transforming λ′

according to the CamelCasenotation17.

For example, consider the enriched frame-based for-mal representation of the sentence

The New York Times reported the death ofJohn McCarthy. He invented the programming lan-guage LISP.

depicted in Figure 4, Legalo produces the triples de-picted in Figure 5, according to the generative rules GR1, 2, and 3, where the prefix legalo: is a namespacedefined using the checksum of the sentence s. Noticethat Figure 5 shows the WiBi types18 for the resolvedDBpedia entities.

OWL property formalisation. For each generatedproperty, Legalo produces an additional set of OWLaxioms that formally define it. The predicate formali-sation states that the predicate is an OWL object prop-erty, and includes domain and range axioms, whosevalues are defined according to the WiBi types as-signed to vsubj and vobj . In case of multi-typing of anentity, the value is the union of all types. In case a WiBitype is not available, the default type is owl:Thing.Example 3.4 shows the axioms formalising domainand range of the properties shown in Figure 5.

Example 3.4. (Domain and range axioms.)

legalo:reportDeathOf a owl:ObjectProperty

;

rdfs:domain wibi:Newspaper ;

rdfs:range wibi:Computer_scientist .

legalo:inventProgrammingLanguage

a owl:ObjectProperty ;

rdfs:domain wibi:Computer_scientist ;

rdfs:range wibi:Programming_language ;

rdfs:subPropertyOf legalo:invent .

As the reader may notice, an additionalrdfs:subPropertyOf axiom is included in the for-mal definition of legalo:inventProgrammingLanguage.In fact, if a predicate is derived with GR 2, meaningthat vsubj and vobj participate in an event with respec-tively, an agentive and a passive role, then Legalo alsogenerates a more general property based on the event

17According to a common Linked Data convention, using theCamelCase notation for OWL object properties makes the first termof the ID start with lower case, e.g. “invent programming language”-> inventProgrammingLanguage.

18http://www.wibitaxonomy.org/

type, and produces a rdfs:subPropertyOf axiom.We remind that in these cases, the rule requires to gen-erate a specialised property label by appending theWiBi type of vobj to the label of the event type. Ex-ample 3.4 shows one of this cases. All properties pro-duced by Legalo are derived from a formal representa-tionG of the sentence s, meaning thatG provides theirformal scope. Based on this principle, Legalo producesan additional set of triples, which formalise the gener-ated properties with reference to G. As stated by GR1 and 2, there are two main types of paths from whichthe properties can derive. In the first case, the path con-necting vsubj and vobj does not include any event node.In this case, Legalo produces a OWL property chainaxiom stating that the generated property is implied bythe chain of properties participating in the path, whereeach property of the path is formalised with domainand range axioms according to the locality of G. Thesame concept applies to the case of a path that includesan event node. Similarly, Legalo produces a propertychain axiom. However, in this case the path has twodifferent directions inG. For this types of paths we de-fine the concepts of left branch path, i.e. the one con-necting the event node with vsubj , and right branchpath, i.e. the one connecting the event node withvobj . For example, in Figure 4 the path P connectingfred:John_Mccarthy with fred:Lisp includes anevent, i.e. fred:invent_1. Hence P is a tree, whichroot is this event node. The left branch path of P is theone connecting fred:invent_1 with fred:Lisp,while the right branch path of P is the one connectingfred:invent_1 with fred:John_Mccarthy. In or-der to define a property chain axiom Legalo needs todefine the inverses of all properties in the left branchof P . However, these branch paths may contain prop-erties defined by VerbNet, i.e. thematic roles, whichare independent of the event they are associated with,in the scope of G., i.e. they are general domain prop-erties. In other words, these properties do not carryany information about the event included in the path,which is relevant as far as the formal semantics of thegenerated property is concerned. Legalo tackles thisissue by defining a local thematic role property foreach VerbNet role participating in the event includedin the path. For example, let us consider the propertylegalo:inventProgrammingLanguage in Figure5. Its reference path includes the two (thematic roles)properties vn.role:Agent and vn.role:Product.Legalo generates two new properties, legalo:AgentInventand legalo:ProductInvent, defined as sub-propertiesof vn.role:Agent and vn.role:Product, respec-

Page 13: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 13

Fig. 6.: The grounding vocabulary used for annotating the generated triples and properties with information abouttheir linguistic and formal representation scope.

tively. Given these two new properties, the axioms pro-duced for formalising the generated propertylegalo:inventProgrammingLanguage are givenin Example 3.5

Example 3.5. (Property Chain Axiom when the con-necting path includes an event.)

legalo:inventProgrammingLanguage

a owl:ObjectProperty ;

owl:propertyChainAxiom _:b1 .

legalo:AgentInvent

rdfs:subPropertyOf vn.role:Agent .

legalo:ProductInvent

rdf:subPropertyOf vn.role:Product .

_:b1:-([owl:inverseOf legalo:AgentInvent]

legalo:ProductInvent) .

Scope annotations. Finally, Legalo annotates all gen-erated properties and triples with information relatedto the linguistic and formal representation scopesfrom which they were derived. To this aim a spe-cific OWL ontology has been defined, named ground-ing19, depicted in Figure 6. This ontology reuses Ear-mark20, a vocabulary for annotating textual content,and semiotics21, a content ontology pattern that en-codes a basic semiotic theory. Earmark defines theclass earmark:Docuverse, which represents anycontainer of strings that may appear in a document. Inthe context of Legalo this class can be used for rep-resenting the sentence s. The semiotics content pat-tern defines three main classes: Expression, Meaning,

19The vocabulary can be downloaded from http://ontologydesignpatterns.org/cp/owl/grounding.owl

20http://www.essepuntato.it/2008/12/earmark21http://www.ontologydesignpatterns.org/cp/

owl/semiotics.owl, prefix semio:

Reference (the semiotic triangle). The class Expres-sion is also reused for representing the sentence s. Asfor the annotation of the linguistic scope of a RDFtriple, the grounding vocabulary defines the more spe-cific concept of “linguistic evidence”. In fact, accord-ing to the axioms defined in Section 3.2 and the gen-erative rules defined in Section 3.3, the sentence sprovides an evidence of the relation ϕs(esubj , eobj),which is formalised by a RDF triple (vsubj , pλ, vobj).The concept of “linguistic evidence” is represented bythe class LinguisticEvidence that specialises bothearmark:Docuverse and semiotics:Expression.The OWL property that relates a RDF triple generatedby Legalo and its linguistic evidence ishasLinguisticEvidence.Additionally, the class FrameBasedFormalModel isdefined for representing the concept of frame-basedformal representation of a textual sentence, describedin detail in Section 3.1. This class is instantiated bythe graph G representing s, which provides the for-mal scope for all generated properties and triples. Theproperty derivedFromFormalRepresentation ofthe grounding ontology, connects a Legalo generatedproperty as well as a RDF triple, with the graph Gfrom which they were derived. As an example, let usconsider the sentence represented by the graph in Fig-ure 4 and the generated RDF triple of the propertylegalo:inventProgrammingLanguage depictedin Figure 5. The scope annotations shown in Example3.6 are generated.

Example 3.6. (Scope annotations of a generated prop-erty)legalo:sentence

a grounding:LinguisticEvidence ;

earmark:hasContent "The New York Times

reported the death of McCarthy. He invented

LISP." .

Page 14: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

14 V. Presutti et al. / From hyperlinks to Semantic Web properties

[] a owl:Axiom ;

grounding:hasLinguisticEvidence

legalo:sentence;

owl:annotatedProperty

legalo:inventProgrammingLanguage ;

owl:annotatedSource

dbpedia:John_McCarthy_(computer_scientist);

owl:annotatedTarget

dbpedia:Lisp_(programming_language) .

legalo:inventProgrammingLanguage

a owl:ObjectProperty ;

grounding:derivedFromFormalRepresentation

krgraph:52f88ca22 ;

grounding:definedFromLinguisticEvidence

legalo:sentence .

The first two axioms simply create an individual oftype LinguisticEvidence for representing the sen-tence. The second group of axioms annotates the RDFtriple for “John McCarthy invented Lisp” with its lin-guistic evidence.Finally, the legalo:inventProgrammingLanguageproperty is annotated with its linguistic as well as itsformal scope.

3.5. Alignment to Semantic Web vocabularies

This step has the goal of aligning the generatedproperties to existing Semantic Web ones. The ideais to maximise reuse and linking of extracted knowl-edge to existing Linked Data. Legalo implements asimple string matching technique based on the Leven-shtein distance measure for addressing this task. Theimplementation of more sophisticated approaches foraligning generated properties to existing vocabulariesis part of future work. Relevant related work are ontol-ogy matching techniques such as [13] (cf. see the On-tology Alignment Evaluation Initiative22). A possiblestrategy is to apply state-of-the-art techniques in ontol-ogy matching exploiting the information and featuresprovided by the formalisation step (cf. Section 3.4).Legalo uses three semantic resources for identifyingpossible targets for property alignment:

– Watson23 [9] is a service that provides access toSemantic Web knowledge, in particular ontolo-gies;

22http://oaei.ontologymatching.org/23http://watson.kmi.open.ac.uk/WatsonWUI/

– Linked Open Vocabularies (LOV)24 is an ag-gregator of Linked Open vocabularies (includ-ing DBpedia), and provides services for accessingtheir data;

– Never-Ending Language Learning (NELL)25 [5]is a machine learning system that extracts struc-tured data from unstructured Web pages andstores it in a knowledge base. It runs continuouslysince 2010. From the learnt facts, NELL team hasderived an ontology of categories and properties:it includes 548 properties at the moment26.

In principle other resources can be added and couldbe selected, we chose these three resources becausethey allow us to both cover most of public linked datavocabularies (i.e. LOV and Watson), and test with au-tomatically generated resources (i.e. NELL).

4. Legalo pipeline and components

Legalo is based on a pipeline of components and datasources, executed in the sequence illustrated in Figure7.

1. FRED: Semantic Web machine reader. The corecomponent of the system is FRED [39], a Seman-tic Web machine reader able to produce a RDF/OWLframe-based representation of a text. It integrates theoutput of several NLP tools, enriches and transformit by reusing Linguistic Frames [32], Ontology De-sign Patterns [20], open data, and various vocabular-ies. FRED detects events, roles, and n-ary relationsand represent in a RDF/OWL graph. It also representvariable discourse referents, such as the variable in thefirst-order predication Cat(x) extracted from the sen-tence The cat is on the mat, they are formalised asreified individuals e.g. cat_1. As far as Legalo isconcerned, the most used features of FRED are theframe-based graph representation based on VerbNetverbs and thematic roles, the Named Entity Recogni-tion (NER) and Resolution component i.e., TAGME[14], and the annotation of text fragments, based on theEarmark vocabulary and annotation method [35].

All figures depicted in Section 2 show examples ofFRED outputs: the reader may want to consider Figure3, which show the RDF/OWL graph for the sentence“In February 2009 Evile began the pre-production pro-

24http://lov.okfn.org/dataset/lov/25http://rtw.ml.cmu.edu/rtw/26http://nell-ld.telecom-st-etienne.fr/

Page 15: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 15

Fig. 7.: Pipeline implemented by Legalo for generating Semantic Web properties for semantic annotation of hyper-links based on their linguistic trace, i.e. natural language sentence including the hyperlinks. Numbers indicate theorder of execution of a component in the pipeline. Edges indicates input/output flows. (*) denotes tools developedin this work, which are part this paper contribution.

cess for their second album with Russ Russell” as arepresentative output of FRED.

2. Entity pair selection. This component is in chargeof detecting the resolved entities and associate themwith their lexical surface in s. This is done by query-ing FRED text span annotations. Another task ofthis component is, for each pair of detected entities(vsubj , vobj), to assess the existence of ϕs betweenthem. In other words, this component checks the ex-istence paths between vsubj and vobj (cf. Axiom 1),selects the shortest one and verifies if there are eventnodes in the selected path. If so, it verifies if vsubj par-ticipates in the event occurrence with an agentive role(cf. Axiom 3). All selected pairs and associated pathsare passed to the next component.

3. RDF/OWL writer. This component is in charge ofgenerating a predicate for each pair of entities receivedin input from the previous component, by applying thegenerative rules described in Section 3.3 to its associ-ated path. In addition, this component implements twomore modules: the “Property matcher” and the “For-maliser”.

The “Property matcher” is in charge of findingalignments between the generated predicate, and exist-ing Semantic Web vocabularies. As described in Sec-tion 3.5, three main sources are used for retrieving se-mantic property candidates. For assessing their simi-larity with the generated predicate a string matchingalgorithm was implemented, which computes a Leven-shtein distance [31] between the IDs of the two predi-cates. Of course, this component is not intended to be a

Page 16: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

16 V. Presutti et al. / From hyperlinks to Semantic Web properties

contribution to advance the state of the art in ontologymatching, its goal is to contribute to a complete imple-mentation of OKE and to provide a possible baselinefor comparing results with future improved versions.

Finally, the RDF/OWL writer includes the compo-nent “Formaliser”. This component implements theformalisation step of the method (cf. Section 3.4). It isin charge of producing the triples summarising the re-lation expressed in s, and that can be used for annotat-ing the corresponding hyperlink, to generate OWL ax-ioms defining domain and range of the generated pred-icates, and finally to annotate the produced triples andpredicates with scope information.

Legalo for typing Wikipedia pagelinks. A specialisedversion of Legalo for typing Wikipedia pagelinks(Legalo-Wikipedia)27 was presented in [38], howeverit relied on a previous version of the tool. In fact,Legalo-Wikipedia depends on Legalo, hence it evolveswith it, and specialises it with two additional fea-tures: (i) a sentence extractor specialised for WikipediaHTML formatting, and (ii) a subject resolver spe-cialised for Wikipedia. A detailed description of thisimplementation can be found in [38].

Briefly, Legalo-Wikipedia takes in input a DBpe-dia entity URI, and retrieves all its pagelinks triplesfrom the Pagelinks DBpedia dataset. For each pagelinktriple it extracts all Wikipedia snippets containing anhyperlink corresponding to the triple by means of aspecialised sentence extractor. Then, the subject re-solver selects all and only the snippets that contain alexicalisation of the Wikipedia page subject, by relyingon the DBpedia Lexicalizations Dataset28.

For example, the wikipage wp:Ron_Cobb includesa link to wp:Sydney in the sentence:

“In 1972, Cobb moved to Sydney, Australia, wherehis work appeared in alternative magazines such asThe Digger.”

This sentence will be selected and stored as it con-tains the term “Cobb”, which is a lexicalization ofdbpedia:Ron_Cobb. The same wikipage includesa link to wp:Los_Angeles_Free_Press in thesentence:

“Edited and published by Art Kunkin, the Los AngelesFree Press was one of the first of the underground

27A demo is available at http://wit.istc.cnr.it/stlab-tools/legalo/wikipedia

28http://wiki.dbpedia.org/Datasets/NLP?v=yqj

newspapers of the 1960s, noted for its radical poli-tics.”

This sentence will be discarded as it does not in-clude any lexicalisation of dbpedia:Ron_Cobb.This procedure is needed for identifying pagelinks thatactually convey a semantic factual relation between theWikipedia page subject and the target of the pagelink.Each snippet is then passed to Legalo as input forgenerating the Semantic Web property. The version oflegalo-wikipedia presented in [38] relied on a previousversion of Legalo, which supported less general gen-erative rules and did not perform the relevant relationassessment or the formalisation of the generated prop-erty.

5. Results and evaluation

Legalo-Wikipedia has been previously evaluated.For the sake of completeness, these results are sum-marised in Section 5.2 (for additional details, thereader can refer to [38]). With the help of crowdsourc-ing an additional, more extensive evaluation of the cur-rent implementation of Legalo was performed, whichallowed us to better assess its performances and openissues. This section reports this evaluation results interms of precision, recall, and accuracy.

5.1. Legalo working hypothesis

Legalo is based on two working hypotheses.

Hypothesis 1 (Relevant relation assessment). Legalois able to assess if, given a sentence s, a relevant rela-tion exists which holds between two entities, accordingto the content of s:

∃ϕ.ϕs(esubj , eobj)

This means that if s contains evidence of a relevantrelation between esubj and eobj , then Legalo returns atrue value, otherwise it returns false.

Hypothesis 2 (Usable predicate generation). Legalo isable to generate a usable predicate λ

′for a relevant re-

lation ϕs between two entities, expressed in a sentences: given λ

′, a label generated by Legalo for ϕs, and

λi a label generated by a human for ϕs the followingholds (cf. Definition 1):

λ′ ∼= λi, λi ∈ Λ

Page 17: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 17

which means that the label λ′

generated by Legalo isequal or very similar to a label λi that a human woulddefine in a Linked Data vocabulary for representing ϕsin a particular textual occurrence.

This section reports the evaluation of Legalo basedon the validation of Hypothesis 1 and Hypothesis 2.

Evaluation sample. As evaluation data, a corpusCrel−extraction for relation extraction developed atgoogle research29 was used. There are five datasetsavailable in this corpus, and each dataset is dedicatedto a specific relation: place of birth, attending or gradu-ating from an institution, place of death, date of death,degree of education. Each dataset includes a snippetfrom Wikipedia, a pair (subject, object) of freebaseentities, and at least five user judgments that indicateif the snippet contains a sentence providing evidenceof a referenced relation (e.g., place of death) betweenthe given pair of entities. It is important to remarkthat Wikipedia snippets included in the corpus con-tain more than one sentence, which can be evidence ofother relations than the ones for which they were eval-uated. Based on this observation, the corpus has beenused also for evaluating Legalo on its ability to assessthe existence of open-domain relations.

It has to be noticed that Legalo addresses all the ca-pabilities of an OKE system (cf. Section 2), howeverby using Crel−extraction for its evaluation and consid-ering the homogeneous writing style of Wikipedia au-thors, additional experiments are needed to properlyassess Legalo scalability performance on Web diver-sity (e.g. blogs, twitter, etc.). In other words, Legalocan be used with any input text, but the different stylesof the diverse Web sources could affect its perfor-mance. We leave the investigation of possible biascaused by different writing styles to future develop-ment.

The evaluation was performed using a subset ofCrel−extraction. More specifically, three evaluationdatasets were derived from Crel−extraction and usedfor performing different experimental tasks.

– Cinstitution: a sample of 130 randomly selectedsnippets extracted from the file ofCrel−extractiondedicated to evidence of relations expressing “at-tending or graduating from an institution”. Legalowas executed on all 130 snippets, including inits input the pair of freebase entities associatedwith the snippet in Cinstitution. For each snippet,

29https://code.google.com/p/relation-extraction-corpus/downloads/list

Legalo gave an output, either one or more predi-cates or “no relation evidence” (i.e. false value);

– Ceducation: a sample of 130 randomly selectedsnippets extracted from the file ofCrel−extractiondedicated to evidence of relations expressing “ob-taining a degree of education”. Legalo was exe-cuted on all 130 snippets, including in its input thepair of freebase entities associated with the snip-pet in Ceducation. For each snippet Legalo gavealways an output, either one or more predicates or“no relation evidence”:

– Cgeneral: a sample of 60 randomly selected snip-pets extracted from Crel−extraction, 15 snippetsfrom each file (excluding “date of death” asLegalo only deals with object properties for themoment). The snippets were broken into singlesentences and pre-processed with Tagme [14] inorder to enrich them with hyperlinks referring toWikipedia pages (i.e. DBpedia entities): 186 sen-tences with at least two recognised DBpedia en-tities were derived. In total, Legalo produced 867outputs, of which 262 predicates and 605 “no re-lation evidence”. Notice that the high number offalse values is not surprising as in many cases asingle sentence may contain a high number of en-tities, and Legalo had to assess the existence of ϕon all possible combinations of pairs.

The resulting triples, predicate formalisations, andscope annotations are accessible via a Virtuoso SPARQLendpoint30.

There are several works demonstrating that crowd-sourcing can be successfully used for building andevaluating semantic resources [17,47,34]. Followingthese experiences, Legalo was evaluated with the helpof crowdsourcing. Five different crowdsourced taskswere defined:

1. assessing if a sentence s provides evidence forthe referenced relation (i.e. either “institution”or “education”) between two given entities esubjand eobj mentioned in s - based on data fromCinstitution and Ceducation, respectively;

2. assessing if a sentence s provides evidence forany relation between two given entities esubjand eobj mentioned in s - based on data fromCgeneral;

30Legalo results can be inspected at http://wit.istc.cnr.it:8894/sparql. The reader can submit a pre-defined de-fault query for retrieving an overview of the dataset.

Page 18: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

18 V. Presutti et al. / From hyperlinks to Semantic Web properties

3. judging if a predicate λ′

generated by a machineadequately expresses (i.e. it is a good summari-sation of) a specific relation (i.e. either “institu-tion” or “education”) between two given entitiesesubj and eobj mentioned in s, according to thecontent of s - based on data fromCinstitution andCeducation, respectively;

4. judging if a predicate λ′

generated by a machineadequately expresses (i.e. is a good summarisa-tion of) any relation expressed by the content ofs, between two given entities esubj and eobj men-tioned in s - based on data from Cgeneral;

5. creating a phrase λ that summarises the relationexpressed by the content of s, between two givenentities esubj and eobj mentioned in s - based ondata from Cgeneral.

Task 1 and 2 were used for validating Hypothesis 1.The results of these two tasks were then combined withthose from Tasks 3 and 4, for validating Hypothesis 2.Finally, task 5 was used for comparing the similaritybetween λ values generated by humans and λ

′values

generated by Legalo, for validating Hypothesis 2 froma different perspective.

It is important to remark that Task 1 duplicatesthe information already available in Crel−extraction:this choice was driven by the need for using smallerdatasets (Crel−extraction samples) as Legalo evalua-tion experiments needed to address different evalua-tion tasks. From an analysis of Crel−extraction it hasbeen noticed that some judgements were incorrect,which can be irrelevant on big numbers while it canbias the results on smaller sets. Hence, the corpus sam-ples were re-evaluated on the evidence task, in orderto ensure a high reliability of the judgements. Also,our evaluation focused also on open domain relations,hence addressing a larger number of relations than theone judged originally in the corpus.

The Crowdflower platform31 was used for conduct-ing the crowdsourcing experiments. All tasks includeda set of “gold questions” used for computing a trustscore t for each worker. Workers had to first performtheir job on 7 test questions, and only those reachingt > 0.7 were allowed to continue. The value range oft is [0, 1], the higher the score, the more reliable theworker. Given the strong subjective nature of task 5,only for this task a lower trust score t > 0.6 was con-sidered acceptable. Each run of a job for a worker con-tained 4 questions, and they were free to stop contribut-

31http://www.crowdflower.com/

ing at any time. Each question was performed by atleast three workers, in order to allow the computationof inter-rater agreement. More precisely, Table 2 showshow many different workers performed each task, alsoindicating the hypothesis associated with the task. Be-sides the initial test questions, in order to keep moni-toring workers’ reliability, each job contained one testquestions. Results from test questions were excludedfrom the computation of performance measures (i.e.,precision, recall, accuracy, agreement).

For tasks 1 and 2, judgements were expressed as“yes” or “no” answers. For tasks 3 and 4, judgmentscould be assessed on a scale of three values: Agree(corresponding to a value 1 when computing rele-vance measures), Partly Agree (corresponding to avalue 0.5 when computing relevance measures), andDisagree (corresponding to a value 0 when comput-ing relevant measures). Task 5 was completely open.The confidence measure is provided by CrowdFlower,it measures the inter-rater agreement between workersweighted by their trust values, hence indicating bothagreement and quality of judgements at the same time.It is computed as described in Definition 532, and anexample is given in Example 5.1:

Definition 5. (Confidence score)Given a task unit u, a set of possible judgements {ji},with i = 1, ...n, a set of trust scores each represent-

ing a rater {tk}, with k = 1, ...m, tsum =m∑k=1

tk the

sum of trust scores of raters giving judgements on u,and trust(ji) the sum of tk values of raters that choosejudgement ji, the confidence score confidence(ji, u)for judgement ji on the task unit u is computed as fol-lows:

confidence(ji, u) = trust(ji)tsum

Example 5.1. (Confidence score for evidence judge-ment)Table 3 shows the judgements of three raters on the

same task unit, where possible judgements are “yes”and “no”. tsum = 0.95 + 0.89 + 0.98 = 2.82confidence(“yes′′, 582275117) = 0.95+0.98

2.82 = 0.68confidence(“no′′, 582275117) = 0.89

2.82 = 0.31

When aggregating results for a task unit, the judge-ment with the higher confidence score is selected. No-tice that confidence(ji, u) = 1 when all raters givethe same judgement.

32http://success.crowdflower.com/customer/portal/articles/1295977

Page 19: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 19

Hypothesis Task #workersHypothesis 1 Task 1,2 35Hypothesis 2 Task 3 (institution) 10Hypothesis 2 Task 3 (education) 18Hypothesis 2 Task 4 19Hypothesis 2 Task 5 12

Table 2: Number of different workers that performed the crowdsourced tasks.

Task unit Judgement t582275117 yes 0.95582275117 no 0.89582275117 yes 0.98

Table 3: Example of confidence score computation for a task unit.

Task Relation Precision Recall F-measure Accuracy Confidence2 Any 0.83 0.92 0.87 0.82 0.821 Education 0.95 0.91 0.93 0.87 0.961 Institution 0.93 0.90 0.91 0.84 0.94

Table 4: Results of Legalo performance in assessing the evidence of relations between entity pairs in a given sentences. Performance measures are computed on the judgements collected in Task 1 and 2 based on data from Cinstitution,Ceducation, and Cgeneral.

Evaluation of Hypothesis 1 Table 4 shows the resultsof the evaluation of Hypothesis 1, i.e. Legalo’s abilityto assess if a sentence s provides evidence of a rela-tion ϕs between two entities (esubj , eobj). Task 1 wasdesigned for evaluating this capability on specific re-lations, while Task 2 was designed for evaluating thiscapability on any relation. Each row shows the perfor-mance results for a specific run of the task indicatingthe type of relation tackled and the crowdsourced task.

Legalo’s performance is measured by means of stan-dard metrics: precision, recall, f-measure, and accu-racy. With the aim of clarifying how to interpret themwe briefly report an informal definition of true/falsepositive, and true/false negative in the context of Tasks1 to 4. As for Tasks 1-2, given a sentence s, the crowdwould say “yes” if a relevant relation exists betweena given subject/object pair, and “no” if it does not.Legalo output means “true” (the relation exists) when-ever it produces a relation, while it means “false” (therelation does not exist) whenever it does not. Hence,True positive = the number of (true, yes) pairs, Falsepositive = the number of (true, no) pairs, True negative

= the number of (false, no) pairs, False negative = thenumber of (false, yes) pairs.

The results of the crowdsourced tasks demonstratethat the Legalo method has high performance (averageF-measure=0.92) on the assessment of ϕs(vsubj , vobj)existence (cf. Hypothesis 1). These results are reallysatisfactory especially compared with performance re-sults of Legalo-Wikipedia [38], where this aspect wasnot tackled, and ϕs existence was partly ensured by thenature of input data (cf. see also Section 5.2).

Evaluation of Hypothesis 2 Table 5 shows the resultsof the evaluation of Hypothesis 2, i.e. Legalo’s abil-ity of generating usable predicates for summarising re-lations between entities, according to the content of asentence. Task 3 was designed for evaluating this ca-pability on specific properties, while Task 4 was de-signed for evaluating this capability on any property.Each row shows the performance results indicating thetype of relation tackled and the crowdsourced task. Theresults for “institution” relation and for “any” relationare computed both on the overall set of results, as wellas on a subset that ensured a higher confidence rate

Page 20: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

20 V. Presutti et al. / From hyperlinks to Semantic Web properties

(i.e., only results with confidence(ji, u) > 0.65 areincluded). As far as the evaluation of the “institution”relation is concerned, the subset of results with highconfidence is 68% of the whole evaluation dataset,while for “any” relation it is 76%.

For these tasks, positive values (i.e. when Legalogenerates a relation, i.e. "true") can be judged by thecrowd with “agree”, “partly agree” and “disagree”.Let A be the number of “agree”, PA the number of“partly agree” and D the number of “disagree”. As forthe negative values, the definition is the same as forTask 1-2, and we reuse their results, as they are onthe same datasets. Hence, we compute: True positive= (A + 0.5 ∗ PA), False positive = D, True negative= the number of (false, no) pairs, False negative = thenumber of (false, yes) pairs.

Finally, Hypothesis 2 was evaluated also by com-puting a similarity score between human created pred-icates and Legalo generated ones. Task 5 was per-formed for collecting at least three labels λi for eachtriple (s, esubj , eobj). As paraphrasing is a highly sub-jective task, we expected a very low confidence value.Surprisingly, the average confidence value on this taskwas not that low (0.59). We compared Legalo pred-icate λ

′for a triple (s, esubj , eobj) with all λi cre-

ated by the users for that triple. Two different similar-ity measures were computed: a string similarity scorebased on Jaccard distance measure33, and a seman-tic similarity measure based on the SimLibrary frame-work [36]34. The latter is a semantic similarity scorethat extends string similarity with measures exploitingexternal semantic resources such as such as WordNet,MeSH or the Gene Ontology. The average Jaccard sim-ilarity score between Legalo labels and human ones is0.63, while the SimLibrary score is 0.80 (the intervalvalue of both scores is [0, 1], the higher the score, themore similar the the two phrases). Before computingthe similarity a pre-processing step was performed tothe aim of transforming all verbs to their base form andremoving all auxiliary verbs from human predicates.The Stanford CoreNLP framework35 was used to com-pute the lemma and POS tag of each term in the phrase.This lemmatisation step was necessary in order to en-

33Given two strings s1 and s2, where c1 and c2 are the two char-acter sets of s1 and s2, the Jaccard distance J(s1,s2) is defined asJsim(s1, s2) = Jsim(c1, c2) =

|c1∩c2||c1∪c2|

.34http://simlibrary.wordpress.com/35http://nlp.stanford.edu/software/corenlp.

shtml

sure a fair comparison of labels based on string simi-larity as currently Legalo uses only base verb forms.

Also for Hypothesis 2, Legalo shows very satisfac-tory performance. An impressive result is the high av-erage value of the semantic similarity score (0.80) be-tween user created predicates and Legalo generatedones. This result confirms the hypothesis discussedin [38], saying that the Legalo design strategy wasgood at producing predicates that are very close towhat a human would do when creating a Linked Datavocabulary. In the context of this work, this hypothesiscan be extended to the capability to summarise suchrelations in a way very close to what a generic userwould do. This result is very promising from the per-spective of evolving Legalo into a summarisation tool,which is one of the envisioned directions of research.

However, by inspecting the different relevance mea-sures, it emerges that while recall is very high on alltasks (0.90 on average), average accuracy is 0.73 andaverage precision is 0.75. Although these are verysatisfactory performances, it is worth identifying thecases that cause the generation of less usable or evenbad results. An insight is that lower precision and accu-racy are registered especially in the generation of pred-icates for “institution” (accuracy 0.62, precision 0.65)relations and for “any” relations (accuracy 0.71, preci-sion 0.68) while for “education” relations these mea-sures show significantly higher values (accuracy 0.85,precision 0.92). This turns out as an important lessonlearnt. In fact, less satisfactory precision seems due tothe fact that many “institution” relations between twoentities (X,Z) are described in the form “X receivedhis Y from institution Z” (or similar), i.e. a ternary re-lation, which in a frame-based representation G corre-sponds to something like:

:receive_1 vn.role:Agent :X ;:receive_1 vn.role:Theme :Y ;:receive_1 vn.role:Source :Z .

Currently, based on this representation, Legalowould generate a predicate by following the path con-necting X to Y , hence without considering the infor-mation on Y. The resulting predicate in this case wouldbe “receive from”, while a more informative and us-able one would clearly be, e.g.“receive degree from”,assuming that the type of Y is degree. The term degreeis an example of a possible type for Y, however what-ever is the type of Y, including its type in the pred-icate would make it much more informative and us-able. This case can be easily generalised by exploit-ing the semantic information about the thematic rolethat Y plays in participating in the event receive_1. In

Page 21: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 21

Task Relation Precision Recall F-measure Accuracy Confidence3 Education 0.92 0.91 0.91 0.85 0.803 Institution 0.65 0.91 0.76 0.62 0.59

3 (high confidence only) Institution 0.74 0.89 0.81 0.68 0.714 Any 0.68 0.90 0.78 0.71 0.64

4 (high confidence only) Any 0.73 0.87 0.80 0.75 0.76

Table 5: Results of Legalo performance in producing a usable label for relations between entity pairs in a givensentence. Performance measures are computed on the judgements collected in Tasks 3 and 4 based on data fromCinstitution, Ceducation, and Cgeneral.

fact, a representation pattern can be recognised here:when participating in the event receive_1, X plays anagentive role (as expected from Axiom 3), Y plays apassive role, and Z plays an oblique role. The type ofan entity playing a passive role, i.e. Y in this case, isa relevant information as far as the relation betweenan entity playing an agentive role, and another playingan oblique role in an event, is concerned. This patterncan be generalised to other relations than institution,which explains a similar behaviour of Legalo in thetwo tasks focusing on assessing usability of predicatesfor “institution” and “any” relations. Another examplethat shows this pattern is given by the sentence,

“Hassan Husseini became an organizer for theCommunist Party.”

taken from the dataset Cgeneral. In this case, the rep-resentation is the following:

:become_1 vn.role:Agent Hassan_Husseini ;:become_1 vn.role:Patient :Organizer ;:become_1 :for :Communist_Party .

and Legalo would produce the predicate “become for”.By applying the new suggested generative rule, thegenerated predicate would be instead, the more infor-mative and usable “become organizer for”. This typeof observations leads to the definition of additionalgenerative rules that refine Legalo towards a highlyprobable improvement on precision and accuracy. Newrules are implemented based on the data collected fromthe evaluation results, hence Legalo demo is constantlyevolving.

Evaluating the alignment with existing Semantic Webvocabularies. The matching process performed againstLOV, NELL [5], and Watson [9] returned a number ofproposed alignments between predicates generated byLegalo and existing properties in Linked Data vocabu-laries. In order to accept an alignment and include it inthe formalisation of a Legalo property pnew, a thresh-

old dmin = 0.70 on the computed similarity score (i.e.,normalised difference percentage based Levenshteindistance36) was set, i.e. only alignments between prop-erties having d > 0.70 were kept for the evaluation.All alignments satisfying this requirements were in-cluded in the formalisation of the properties generatedduring this study37.

The alignment procedure was executed on 629Legalo properties pnew. For 250 pnew, it produced atleast one alignment to a Semantic Web property pswwith d > 0.70. Three raters independently judged ona scale of three values (Agree, Neutral, Disagree) theresulting alignments based on the available metadataof psw i.e., comments, labels, domain and range. Ta-ble 6 shows the results of the user-based evaluationof the alignments between pnew and psw. The threeraters have independently judged the proposed align-ment very accurate (Precision 0.84) with a high inter-rater agreement (Kendall’s W 0.76). Although it wasnot possible to compute recall for this evaluation, thelow percentage of proposed alignments (only 40%)and the simple method applied suggest that there isconsiderable room for improvement. This evaluationand the implemented method are to be considered abaseline for future work on this specific task.

5.2. Results and evaluation of Legalo applied toWikipedia pagelinks

A previous study [38] described the evaluation ofLegalo-Wikipedia. In this section the results of thisevaluation are reported, for the sake of completeness.The main difference between Legalo and its Wikipediaspecialised version is that in the latter, the subject ofthe predicate is always given and there is a high prob-

36http://bit.ly/1qd45AQ37All triples, property formalisations, and alignments can be re-

trieved at http://wit.istc.cnr.it:8894/sparql.

Page 22: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

22 V. Presutti et al. / From hyperlinks to Semantic Web properties

# pnew withat least onepsw

Total # of(pnew, psw)

Levenshteinthreshold

Precision Kendall’s W

250 693 0.7 0.84 0.76

Table 6: Evaluation results on the accuracy of the alignment between pnew and psw.

ability that it is correct based on the design princi-ples that guide Wikipedia page writing. It is worthto remark that the evaluation experiment of Legalo-Wikipedia was performed by Linked Data experts,hence comparing the new results with the previousones provides insights on the usability of the generatedpredicates, regardless the expertise of the evaluators.

The evaluation results of Legalo-Wikipedia are pub-lished as RDF data and accessible through a SPARQLenpoint38.

The evaluated sample set consisted of 629 pairs(s, hyperlink), each associated with a FRED graphG. Legalo was executed on this corpus and gener-ated 629 predicates (referred to as pnew from nowon). The user-based evaluation involved three raters,who are computer science researchers familiar withLinked Data, but not familiar with Legalo. Indepen-dently, they have judged the results of Legalo basedon two separate tasks, using a Likert scale of fivevalues (Strongly Agree, Agree, Partly Agree, Dis-agree, Strongly Disagree). When computing perfor-mance measures the scale was reduced to three values.Specifically, Strongly Disagree and Agree where asso-ciated with a value 1, Partly Agree with 0.5, and Dis-agree and Strongly Disagree with 0.

The results of the user-based evaluation of pnew arereported in Table 7. The three raters have indepen-dently judged that the generated predicates pnew werevery well designed and accurate (F-measure 0.83) incapturing the semantics of their associated pagelinksaccording to the content of the sentence s, with a highinter-rater agreement (Kendall’s W 0.73)39.

6. Discussion

Dependency on entity linking An aspect that requiresimprovement is the potential dependency of Legaloperformance on the recognition and linking of DBPe-dia entities in a sentence: if an entity is not in DBpe-

38http://isotta.cs.unibo.it:9191/sparql39Kendall’s W measures the inter-rater agreement. Values ranges

from 0 (complete disagreement) to 1 (complete agreement).

dia, the relation is not generated. Ideally, this is easilysolvable by treating any recognised named entity in asentence as a potential hyperlink, regardless if it has aURI (one can be locally created on the go). A develop-ment version of Legalo40 shows this capability, how-ever besides the need of rigorous experiments for as-sessing its performance, anecdotical tests show that insome cases this generalisation produces noise in the re-sults. Identifying the causes and handling them is oneof our current focus.

Passive form and skolemised entities Identifying re-current errors helps us identifying new patterns forimproving label generation. However, some recurrentmistakes are not easily treatable. One of such cases canbe exemplified by the following sentence:

In March 2008, Evile’s track was featured on theWii, Xbox 360, and PlayStation 3 video gameRock Band as downloadable content.

Currently, Legalo cannot correctly handle this (type of)sentence. There are two main issues motivating thislack: (i) the sentence is expressed in a passive mode,i.e. “was featured on” instead of “features” and the useof the preposition “on” instead of “by” makes the agent“Rock Band” became an oblique role. Hence, there isapparently no agentive role in this sentence, makingAxiom 1 (cf. 3) unsatisfied, which causes Legalo towrongly assess that there is no relevant relation be-tween “Rock Band” and “Evile”; (ii) even if the pas-sive mode was recognised and handled in order tomake Axiom 1 satisfied, the target of the passive re-lation would be “Evile’s track”, which is a variablei.e. the entity representing it is skolemised. A way tohandle this is to name skolemised entities when theyshow certain characteristics. For example in this casethere is a relation between a named entity and thevariable and such relation is genitive, hence having aspecific recognisable characteristic. However, namingskolemised discourse referent should be done at thelevel of FRED result as this operation can be useful in

40http://wit.istc.cnr.it/kore-dev/legalo

Page 23: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 23

Number of pnew Precision Recall F-measure Kendall’s W629 0.72 0.97 0.83 0.73

Table 7: Evaluation results on the accuracy of pnew.

many other application. For example, it can be relevantalso for aspect-based sentiment analysis41.

Open domain and any kind of text sources. The OKEmethod is meant to support knowledge extraction fromtext in the open domain. “Open domain” has a twofoldinterpretation, both valid in this context: (i) any knowl-edge area: meaning that the approach must be indepen-dent from the topics addressed by a text, in other wordit should not be tailored to specific languages, vocabu-laries or terminologies; (ii) any text style: consideringthat natural language on the Web can have many dif-ferent writing styles (e.g., a text in a Wikipedia pageis certainly cleaner than an average blog text, which inturn has a complete different style than twitter writing).The implementation presented in this paper shows verypromising results as demonstrated by the performancemeasured after the execution of a set of crowdsourc-ing tasks. This evaluation was based on texts extractedfrom Wikipedia pages, focusing on both specific andgeneral domains, hence showing that the tool workswell with any knowledge area. Nevertheless, it remainsimportant to investigate how the change of writingstyle impacts on the tool performance, in order to as-sess its behaviour when coping with any text source(going beyond the style of Wikipedia text). This inves-tigation is a main action point in the next evolution ofthis work. It has to be noticed that Legalo’s main tasksare the relation assessment and the label generation,while parsing and role labelling, which are at the baseof the frame-based graph representation, are embeddedin FRED. In other words, Legalo performance highlydepends on the ability of FRED to produce an accu-rate frame-based representation of the input sentence.This means that intervening for minimising the per-formance bias due to different writing styles requiresto intervene on FRED components (especially parsingand role labelling).

Alignment to existing Semantic Web properties. Asfor the alignment procedure, there is also space forsignificant improvement, since this task was addressedby computing a simple Levenshtein distance. More so-phisticated alignment methods such as those from the

41http://alt.qcri.org/semeval2015/task12/

Ontology Alignment Initiative42 or other approachesfor entity linking such as SILK43 [22] can be inves-tigated for enhancing the alignment results. An inter-esting result is that our alignment results are good interms of precision, although all properties that havebeen matched with a distance score > 0.70 came onlyfrom Watson [9] and LOV. We observed that almostall properties retrieved from NELL [5] had an edit-ing distance < 0.70 hence almost none of them werejudged appropriate. This reinforces the hypothesis theOKE generative rules simulate very well the results ofhuman property creation, i.e. property names are cog-nitively well designed. In fact, Watson and LOV arerepositories of Semantic Web authored ontologies andvocabularies, while NELL properties result from anartificial concatenation of categories learnt automati-cally.

As for the alignment recall, it was not possible tocompute standard recall metrics because it is impos-sible to compute False Negative results i.e., all exist-ing Semantic Web properties that would match pnewbut that we did not retrieve. The relatively high num-ber of missing properties suggests on one hand thata more sophisticated alignment method is needed. Onthe other hand, if we combine this result with the highvalue of accuracy of pnew and the proposed alignmentsbetween pnew and psw, it is reasonable to hypothesisethat many cases reveal a lack of intensional coveragein Semantic Web vocabularies, and that OKE can helpfilling this gap.

Comparison to Open Information Extraction Ex-tracting, discovering, or summarizing relations fromtext is not an easy task. Natural language is very subtlein providing forms that can express, allude, or entailrelations, and syntax offers complex solutions to relateexplicitly named entities, anaphoras to mentioned oralluded entities, concepts, and entire phrases, let alonetacit knowledge. Table 8 shows some kinds of (formal-isable) relations that can be derived from text.

A full-fledged analysis of those texts is possible to acertain extent, specially if associated with background

42http://oaei.ontologymatching.org/43http://wifo5-03.informatik.uni-mannheim.

de/bizer/silk/

Page 24: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

24 V. Presutti et al. / From hyperlinks to Semantic Web properties

sentence argument#1 binary re-lation

argument#2

Mr. Miller, 25, enteredNorth Korea sevenmonths ago.

Mr._Miller enter North_Korea

He was charged with un-ruly behavior.

Mr._Miller charge_with x:unruly_behavior

North Korean officialssuspected he was tryingto get inside one of thecountry’s feared prisoncamps.

y:North_Korean_officials suspect try(he, (get_inside (he,z:Korea’s_feared_prison_camp)))

Table 8: Sample sentences involving non-trivial relations, expressed in a generic logical form.

knowledge (as FRED does), but the conciseness anddirectness of hyper-linking based on binary relations isoften lost. Hence the importance of tools like Legalo,which are able to reconstruct binary relations fromcomplex machine reading graphs.

It would be natural to compare the results of Legaloto relation extraction systems, but this would requireto manipulate their output, which is beyond the scopeof this work. Here follows an explanation of the diffi-culties involved.

A state-of-art tool like Open Information Extrac-tion (OIE, [26]) applies an extractive approach to re-lation extraction, and solves the problem by extract-ing segments that can be assimilated to subjects, pred-icates, and objects of a triplet. As reported in [19],its accuracy was not very high with the version ofOIE implemented as the ReVerb tool, but it has sen-sibly improved recently. However, the segments thatare extracted, though useful, are not always intuitivelyreusable as formal RDF properties or individuals. Ta-ble 9 shows one case of a very complex segment#3, i.e. “with a West angry over Russia’s actions inUkraine”, which is a phrase to be further analyzed inorder to be formalized, and typically leading to multi-ple triples; and another case of a complex segment #2,i.e. “developed a passion for the native flora of the aridWest Darling region identifying”, which is not easilytransformable into a RDF property.

The research presented here intends to go beyondtext segmentation, by using an abstractive approachthat selects paths in RDF graphs in order to generateRDF properties. The difference between the two ap-proaches is striking, and leads to results that are diffi-cult to compare. Table 10 shows two of the examplesfrom Table 9 (the third one has no resolvable entity on

the object position), but as they are extracted and for-malized by Legalo.

For the reasons described above, this work has notattempted a direct comparison in terms of accuracybetween OIE and Legalo: it would have needed thetransformation and formalization of OIE text segmentsinto individuals and properties, and arbitrary choiceson how to formalize complex segments. At the end, itis not a measure of their outputs that is obtained, buta measure of authors’ ability to redesign OIE’s out-put. For those interested in attempts to reuse heteroge-neous NLP outputs for formal knowledge extraction,see [19].

7. Related Work

The work presented here can be categorised as for-mal binary relation discovery and labeling from arbi-trary walks in connected fully-labeled multi-digraphs,which means in practice that it is not just relation ex-traction (relations are extracted by FRED [39], andLegalo reuses them), but Legalo discovers complex re-lations that summarise information encoded in severalnodes and edges in the graph (RDF graphs are actuallyconnected, fully-labeled multi-digraphs). It considerscertain paths along arbitrary directions of edges, aggre-gating some of the existing labels, and concatenatingthem in order to provide property names that are typ-ical of Linked Data vocabularies, and finally axioma-tizing the properties with domain, range, subproperty,and property chain axioms.

In other words, Legalo tries to answer the followingquestion: what is the relation that links two (possiblydistant) entities in a RDF graph?

Page 25: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 25

segment #1 segment #2 segment #3 sentenceEugene Nickerson was quarterback of the football team

and captainAt St. Mark’s Schoolin Southborough, Mas-sachusetts, Eugene Nick-erson was quarterbackof the football team andcaptain of the hockeyteam.

President VladimirPutin

faced with a West angryover Russia’s ac-tions in Ukraine

President Vladimir Putin,faced with a West angryover Russia’s actions inUkraine, has been boost-ing ties to the East.

Florence MayHarding

developed a passion forthe native flora of the aridWest Darling region iden-tifying

plants Early in life Florence MayHarding developed a pas-sion for the native flora ofthe arid West Darling re-gion, collecting and iden-tifying plants.

Table 9: Some relations extracted bu OIE from sample sentences.

rdf:subject rdf:property rdf:object sentenced:Eugene_Nickerson l:quarterbackOf d:American-

_FootballAt St. Mark’sSchool in South-borough, Mas-sachusetts, Eu-gene_Nickersonwas quarterback ofthe football teamand captain of thehockey team.

d:Vladimir_Putin l:faceWithAngry-OverActionLocatedIn

d:Ukraine President VladimirPutin, faced witha West angry overRussia’s actions inUkraine, has beenboosting ties to theEast..

Table 10: Two sample extractions by Legalo from the same sentences as in Table 9. For the sake of space we useprefix d: instead of dbpedia:, and l: instead of legalo:

There is not much that can be directly comparable inthe literature, but work from two related fields can becontrasted with what Legalo does: relation extraction,and automatic summarization.

The term Open Knowledge Extraction was previ-ously introduced in the context of Artificial Intelli-gence [10]. This work defines OKE as “conversion ofarbitrary input sentences into general world knowledge

Page 26: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

26 V. Presutti et al. / From hyperlinks to Semantic Web properties

represented in a logical form possibly usable for infer-ence”, hence perfectly compatible with what defined inthis paper. The cited work does not focus on SemanticWeb technologies and languages, although it providesfurther support to our claims and definitions.

The closest works in relation extraction includeOpen Information Extraction (e.g. [26],[30]), relationextraction exploiting Linked Data [46][24], and ques-tion answering on linked data [25].

Relation extraction. The main antecedent to OpenInformation Extraction is probably the 1999 OpenMind Common Sense project [43], which adoptedan ante-litteram crowdsourcing and games-with-a-purpose approach to populate a large informal knowl-edge base of facts expressed in triplet-based naturallanguage. The crowd was left substantially free to ex-press the subject, predicate, and object of a triplet, butduring its evolution, forms started stabilizing, or werelearnt by machine learning algorithms. Currently OpenMind is being merged with several other repositoriesin ConceptNet [21].

Open Information Extraction (aka Machine Read-ing) as it is currently known in the NLP communityperforms bootstrapped (i.e. started with learning froma small set of seed examples, and then recursively andincrementally applied to a huge corpus, cf. [11]), open-domain, and unsupervised information extraction. E.g.OIE is based on learning frequent triplet patterns froma huge shallow parsing of the Web, in order to createa huge knowledge base of triplets composed of textchunks.

This idea (on a smaller scale) was explored in [6],with the goal of resolving predicates to, or to enlarge,a biomedical ontology. On the contrary, OIE extractsbinary relations by segmenting the texts into triplets.However, there is usually no attempt to resolve the sub-jects and objects of those triplets, nor to disambiguateor harmonize the predicates used in the triples. Sincepredicates are not formally represented, they are hardlyreusable for e.g. annotating links with RDFa tags. SeeSection 6 for a comparison between OIE and Legalo,proving the difficulty of even designing a comparisontest.

Overall, Open Information Extraction looks like acomponent for extractive summarization (see below).In [30], named entity resolution is used to resolve thesubjects and objects, and there is an attempt to build ataxonomy of predicates, which are encoded as lexico-syntactic patterns rather than typical predicates.

Another important Open Information Extractionproject is Never Ending Language Learning (NELL) [5],a learning tool that since 2010 processes the web forbuilding an evolving knowledge base of facts, cate-gories and relations. In this case there is a (shallow)attempt to build a structured ontology of recognisedentities and predicates from the facts learnt by NELL.In this work, NELL is used in an attempt to align thesemantic relations resulting from Legalo to the NELLontology.

The main difference between approaches such asOIE and NELL, and Legalo is that the formers focuson extracting mainly direct relations between entities,while Legalo focuses on revealing the semantics of re-lations between entities that can be: a) directly linked,b) implicitly linked, c) suggested by the presence oflinks in Web pages, d) indirectly linked, i.e. expressedby longer paths or n-ary relations. Legalo novelty alsoresides in performing property label generation. Fromthe acquisition perspective, Legalo is not bootstrapped,but it is open-domain and unsupervised.

Relation extraction and question answering targetedat Linked Data are quite different from both Open In-formation Extraction and Legalo, since they are ori-ented at formal knowledge, but they are not boot-strapped, open domain and unsupervised. They typi-cally use a finite vocabulary of predicates (e.g. fromDBpedia ontology), and use their extensional interpre-tation in data (e.g. DBpedia) to either link two entitiesrecognized in some text (as in [46][24]), or to find ananswer to a question, from which some entities havebeen recognized (as in [25]). Domain is therefore lim-ited to the coverage of the vocabulary, and distant su-pervision is provided by the background knowledge(e.g. [1]. A growing repository of relationships ex-tracted with this specific domain, distanly supervisedapproach is sar-graphs [46].

Automatic summarisation. Automatic summariza-tion deserves a short discussion, since ultimatelyLegalo’s relation discovery can be used as a compo-nent for that application task. According to [40], themain goal of a summary is to present the main ideasfrom one or more documents in less space, typicallyless than half of one document. Different categoriza-tions of summaries have been proposed: topic-based,indicative, generic, etc., but the most relevant seemsto distinguish between “extracts” and “abstracts”. Ex-tracts are summaries created by reusing portions of theinput text verbatim, while abstracts are created by re-formulating or regenerating the extracted content. An

Page 27: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 27

extraction step is needed in any case, but while extractscompress the text by squeezing out unimportant ma-terial, and fuse the reused portions, abstracts typicallymodel the text, by accessing external information, ap-plying frames, deep parsing, etc., eventually generat-ing a summary that in principle could contain no wordin common with the original text.

Extractive summarization is now in mass usage, e.g.with snippets provided by search engines. It has seri-ous limits, because size and relevance of the extractscan be questionable and not as accurate as a humanmay be.

Legalo can be considered closer to abstractive sum-marization, since it can be used to build frame-basedabstractive summaries of texts, consisting in binary re-lation discovery, which can then be filtered for rele-vance. The current implementation of Legalo is not de-signed in view of abstractive summarization, thereforeit was not evaluated it for that task, but it is appropriateto report at least one relevant example of related workin this area.

Opinosis [18] is the state-of-the-art system for ab-stractive summarisation. It performs graph-based sum-marisation, generating concise abstractive summariesof highly redundant opinions. It uses a word graph datastructure to represent the text, whereas Legalo usesa semantic graph. As the authors say: “Opinosis is ashallow abstractive summariser as it uses the originaltext itself to generate summaries. This is unlike a trueabstractive summariser that would need a deeper levelof natural language understanding”. Legalo is indeedbased on FRED [39], which provides such deeper levelof understanding.

In order to be considered an abstractive summariser,Legalo will need to be complemented with more capa-bilities to rank discovered relations across an entire oreven multiple texts, to associate them in a way that fi-nal users can make sense of, and to evaluate summariesappropriately. Results from both abstractive summari-sation (e.g. [49][18][23]) and RDF graph summary(e.g. [48][37][4]) can be reused to that purpose.

8. Conclusion and future work

Conclusion. This paper presents a novel approachfor Open Knowledge Extraction, and its implementa-tion called Legalo, for uncovering the semantics of hy-perlinks based on frame-based formal representationof natural language text, and heuristics associated withsubgraph patterns. The main novel aspects of the ap-

proach are: relevant relation assessment, label genera-tion, Semantic Web property generation and formali-sation.

The working hypothesis is that hyperlinks (eithercreated by humans or knowledge extraction tools) pro-vide a pragmatic trace of semantic relations betweentwo entities, and that such semantic relations, theirsubjects and objects, can be revealed by processingtheir linguistic traces: the sentences that embed the hy-perlinks. Evaluation experiments conducted with thehelp of a crowdsourcing platform confirm this hypoth-esis, and show very high performances: the methodis able to predict the actual presence of a relationwith a high precision (average F-measure 0.92), andgenerate accurate RDF properties between the hyper-linked entities in single-relation corpora (average F-measure 0.84), the Wikipedia page link corpus (aver-age F-measure 0.84), as well as in the challenging opendomain corpus (average F-measure 0.78). The accu-racy remains constant across crowdsourced evaluation,and comparison to (crowdsourced) gold standard forthe open domain corpus. We also provide alignmentsto Semantic Web vocabularies with a precision valueof 0.84.

A demo of Legalo Web service is available online13,as well as the prototype dedicated to Wikipedia pagelinks27,and the binary properties produced in this study can beaccessed by means of a sparql endpoint36.

Ongoing work. Current work concentrates on de-signing and testing new heuristics, as required by ev-idence emerging from experiments and tests (cf. e.g.Section 6), on identifying new ways of aligning the re-lations generated by Legalo to existing ontologies, andon discovering regularities in the relation taxonomiesthat are increasingly discovered. Additionally, new ex-periments are under development for assessing Legaloscalability on the diversity and size of the Web.

Future work. The main research line for the fu-ture is to apply Legalo to application tasks. An obvi-ous one is a real abstractive summarisation task, bothat single-text, and multiple-text level, evaluating theresults against state-of-the-art tools. The challengesthere include at least: (i) managing multiple (and pos-sibly dynamically evolving) Open Knowledge Extrac-tion graphs, (ii) assessing relevance of discovered re-lations, and their dependence across a same text, oracross multiple texts, and (iii) generating factoid se-quences that make sense to a final user of abstractivesummaries. Also other applications of Legalo are en-

Page 28: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

28 V. Presutti et al. / From hyperlinks to Semantic Web properties

visioned, including question answering and textual en-tailment.

References

[1] I. Augenstein, D. Maynard, and F. Ciravegna. Relation extrac-tion from the web using distant supervision. In K. Janowicz,S. Schlobach, P. Lambrix, and E. Hyvönen, editors, Proceed-ings of Knowledge Engineering and Knowledge Management- 19th International Conference, EKAW 2014, volume 8876 ofLecture Notes in Computer Science, pages 26–41, Linköping,Sweden, 2014. Springer. doi:10.1007/978-3-319-13704-9_3.

[2] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - TheStory So Far. International Journal of Semantic Web Informa-tion Systems, 5(3):1–22, 2009. doi:10.4018/jswis.2009081901.

[3] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cy-ganiak, and S. Hellmann. DBpedia - A crystallization pointfor the Web of Data. International Journal of Web Semantics,7(3):154–165, 2009. doi:10.1016/j.websem.2009.07.002.

[4] S. Campinas, T. E. Perry, D. Ceccarelli, R. Delbru, and G. Tum-marello. Introducing rdf graph summary with application toassisted sparql formulation. In A. Hameurlain, A. M. Tjoa,and R. Wagner, editors, Proceedings of the 23rd Interna-tional Workshop on Database and Expert Systems Applications(DEXA), pages 261–266, Vienna, Austria, 2012. IEEE Com-puter Society. doi:10.1109/DEXA.2012.38.

[5] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr.,and T. M. Mitchell. Toward an architecture for never-endinglanguage learning. In M. Fox and D. Poole, editors, Proceed-ings of the Twenty-Fourth Conference on Artificial Intelligence(AAAI), pages 1306–1313, Georgia, USA, 2010. AAAI Press.

[6] M. Ciaramita, A. Gangemi, E. Ratsch, J. Šaric, and I. Rojas.Unsupervised learning of semantic relations between conceptsof a molecular biology ontology. In L. P. Kaelbling and A. Saf-fiotti, editors, Proceedings of the 19th International Joint Con-ference on Artificial Intelligence (IJCAI), pages 659–664, Ed-inburgh, Scotland, 2005. Professional Book Center.

[7] B. Comrie. Language universals and linguistic typology: Syn-tax and morphology. University of Chicago press, Chicago,USA, 1989.

[8] W. Croft. Syntactic categories and grammatical relations: Thecognitive organisation of information. University of ChicagoPress, Chicago, USA, 1991.

[9] M. d’Aquin, E. Motta, M. Sabou, S. Angeletou, L. Grindinoc,V. Lopez, and D. Guidi. Towards a new generation of seman-tic web applications. IEEE Intelligent Systems, 23(3):80–83,2008. doi:10.1109/MIS.2008.54.

[10] B. V. Durme and L. K. Schubert. Open knowledge extractionusing compositional language processing. In R. Basili, J. Bos,and A. Copestake, editors, Proceedings of the 2008 Confer-ence on Semantics in Text (STEP), pages 239–254, Venice,Italy, 2008. The Association for Computational Linguistics.doi:10.1.1.154.5832.

[11] O. Etzioni, M. Banko, and M. J. Cafarella. Machine reading.In Y. Gil and R. J. Mooney, editors, Proceedings of the Twenty-first Conference on Artificial Intelligence (AAAI), pages 1517–1519, Boston, Massachusetts, 2006. AAAI Press.

[12] O. Etzioni, M. Banko, S. Soderland, and D. S. Weld. Openinformation extraction from the web. Communications of theACM, 51(12):68–74, 2008. doi:10.1145/1409360.1409378.

[13] J. Euzenat and P. Shvaiko. Ontology Matching, Second Edi-tion. Springer, Berlin, Germany, 2013. doi:10.1007/978-3-642-38721-0.

[14] P. Ferragina and U. Scaiella. TAGME: On-the-fly Annota-tion of Short Text Fragments (by Wikipedia Entities). InJ. Huang, N. Koudas, G. J. F. Jones, X. Wu, K. Collins-Thompson, and A. An, editors, Proceedings of the 19th ACMInternational Conference on Information and Knowledge Man-agement (CIKM), pages 1625–1628, Toronto, Canada, 2010.ACM. doi:10.1145/1871437.1871689.

[15] C. J. Fillmore. Frame semantics. In L. S. of Korea, editor, Lin-guistics in the Morning Calm, pages 111–137. Hanshin Pub-lishing Co., 1982. 10.1016/B0-08-044854-2/00424-7.

[16] T. Flati, D. Vannella, T. Pasini, and R. Navigli. Two Is Bigger(and Better) Than One: the Wikipedia Bitaxonomy Project. InToutanova and Wu [44], pages 945–955.

[17] M. Fossati, C. Giuliano, and S. Tonelli. Outsourcing framenetto the crowd. In P. Fung and M. Poesio, editors, Proceedingsof the 51st Annual Meeting of the Association for Computa-tional Linguistics (ACL) Volume 2: Short Papers, pages 742–747, Sofia, Bulgaria, 2013. The Association for ComputationalLinguistics.

[18] K. Ganesan, C. Zhai, and J. Han. Opinosis: a graph-based ap-proach to abstractive summarization of highly redundant opin-ions. In C. Huang and D. Jurafsky, editors, Proceedings of the23rd International Conference on Computational Linguistics(COLING), pages 340–348, Beijing, China, 2010. TsinghuaUniversity Press.

[19] A. Gangemi. A Comparison of Knowledge Extraction Toolsfor the Semantic Web. In P. Cimiano, O. Corcho, V. Pre-sutti, L. Hollink, and S. Rudolph, editors, Proceedings of the10th Extended Semantic Web Conference (ESWC) The Se-mantic Web: Semantics and Big Data, volume 7882 of Lec-ture Notes in Computer Science, pages 351–366, Montpellier,France, 2013. Springer. doi:10.1007/978-3-642-38288-8_24.

[20] A. Gangemi and V. Presutti. Ontology Design Patterns.In S. Staab and R. Studer, editors, Handbook on Ontolo-gies, 2nd Edition, pages 221–243. Springer Verlag, 2009.doi:10.1007/978-3-540-92673-3_10.

[21] C. Havasi, R. Speer, and J. Alonso. Conceptnet: A lexi-cal resource for common sense knowledge. In N. Nicolov,G. Angelova, and R. Mitkov, editors, Recent advances in nat-ural language processing V: selected papers from RANLP2007, volume 309 of Current Issues in Linguistic Theory,pages 269–280. John Benjamins Publishing Company, 2007.doi:10.1075/cilt.309.22hav.

[22] R. Isele and C. Bizer. Active learning of expres-sive linkage rules using genetic programming. In-ternational Journal of Web Semantics, 23:2–15, 2013.doi:10.1016/j.websem.2013.06.001.

[23] H. Ji, B. Favre, W.-P. Lin, D. Gillick, D. Hakkani-Tur, andR. Grishman. Open-domain multi-document summarizationvia information extraction: Challenges and prospects. InT. Poibeau, H. Saggion, J. Piskorski, and R. Yangarber, editors,Multi-source, Multilingual Information Extraction and Sum-marization, Theory and Applications of Natural Language Pro-cessing, pages 177–201. Springer, Berlin Heidelberg, 2013.

[24] A. Khalili, S. Auer, and A.-C. Ngonga Ngomo. conTEXT –

Page 29: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

V. Presutti et al. / From hyperlinks to Semantic Web properties 29

Lightweight Text Analytics using Linked Data. In V. Presutti,C. d’Amato, F. Gandon, M. d’Aquin, S. Staab, and A. Tordai,editors, The Semantic Web: Trends and Challenges - 11th In-ternational Conference, (ESWC). Proceedings, volume 8465of Lecture Notes in Computer Science, pages 628–643, Crete,Greece, 2014. Springer. doi:10.1007/978-3-319-07443-6_42.

[25] V. Lopez, A. Nikolov, M. Sabou, V. S. Uren, E. Motta, andM. d’Aquin. Scaling up question-answering to linked data. InP. Cimiano and H. S. Pinto, editors, Proceedings of the 17th In-ternational Conference on Knowledge Engineering and Man-agement by the Masses (EKAW), pages 193–210, Lisbon, Por-tugal, 2010. Springer. doi:10.1007/978-3-642-16438-5_14.

[26] Mausam, M. Schmitz, R. Bart, S. Soderland, and O. Etzioni.Open language learning for information extraction. In Tsujiiet al. [45], pages 523–534.

[27] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant su-pervision for relation extraction without labeled data. InK.-Y. Su, editor, Proceedings of the Joint Conference of the47th Annual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processing of theAFNLP: Volume 2 - Volume 2, pages 1003–1011, Suntec,Singapore, 2009. Association for Computational Linguistics.doi:10.3115/1690219.1690287.

[28] A. Moro and R. Navigli. Integrating syntactic and seman-tic analysis into the open information extraction paradigm.In F. Rossi, editor, Proceedings of the Twenty-Third Inter-national Joint Conference on Artificial Intelligence (IJCAI),pages 2148–2154, Beijing, China, 2013. AAAI Press/IJCAI.

[29] A. Moro, A. Raganato, and R. Navigli. Entity linking meetsword sense disambiguation: A unified approach. Transac-tions of the Association for Computational Linguistics (TACL),2:231–244, 2014.

[30] N. Nakashole, G. Weikum, and F. Suchanek. Patty: A taxon-omy of relational patterns with semantic types. In Tsujii et al.[45], pages 1135–1145.

[31] G. Navarro. A guided tour to approximate stringmatching. ACM Computing Surveys, 33(1):31–88, 2001.doi:10.1145/375360.375365.

[32] A. G. Nuzzolese, A. Gangemi, and V. Presutti. GatheringLexical Linked Data and Knowledge Patterns from FrameNet.In M. A. Musen and Ó. Corcho, editors, Proceedings ofthe sixth international conference on Knowledge Capture(K-CAP), pages 41–48, Banff, AB, Canada, 2011. ACM.doi:10.1145/1999676.1999685.

[33] A. G. Nuzzolese, A. Gangemi, V. Presutti, and P. Ciancarini.Encyclopedic Knowledge Patterns from Wikipedia Links. InL. Aroyo, C. Welty, H. Alani, J. Taylor, A. Bernstein, L. Kagal,N. F. Noy, and E. Blomqvist, editors, Proceedings fo the 10thInternational Semantic Web Conference (ISWC), Part I, vol-ume 7031 of Lecture Notes in Computer Science, pages 520–536, Bonn, Germany, 2011. Springer. doi:10.1007/978-3-642-25073-6_33.

[34] J. Oosterman, A. Nottamkandath, C. Dijkshoorn, A. Boz-zon, G. Houben, and L. Aroyo. Crowdsourcing knowledge-intensive tasks in cultural heritage. In F. Menczer, J. Hendler,W. H. Dutton, M. Strohmaier, C. Cattuto, and E. T. Meyer, ed-itors, ACM Web Science Conference (WebSci), pages 267–268,IN, USA, 2014. ACM. doi:10.1145/2615569.2615644.

[35] S. Peroni, A. Gangemi, and F. Vitali. Dealing with markupsemantics. In C. Ghidini, A. N. Ngomo, S. N. Lindstaedt,and T. Pellegrini, editors, Proceedings the 7th International

Conference on Semantic Systems (I-SEMANTICS), ACM Inter-national Conference Proceeding Series, pages 111–118, Graz,Austria, 2011. ACM. doi:10.1145/2063518.2063533.

[36] G. Pirrò and J. Euzenat. A Feature and Information TheoreticFramework for Semantic Similarity and Relatedness. In P. F.Patel-Schneider, Y. Pan, P. Hitzler, P. Mika, L. Zhang, J. Z.Pan, I. Horrocks, and B. Glimm, editors, Proceedings of the 9thInternational Semantic Web Conference (ISWC) Part I, pages615–630, 2010. doi:10.1007/978-3-642-17746-0_39.

[37] V. Presutti, L. Aroyo, A. Adamou, A. Gangemi, andG. Schreiber. Extracting core knowledge from linked data. InO. Hartig, A. Harth, and J. Sequeda, editors, Proceedings ofthe Second International Workshop on Consuming Linked Data(COLD), volume 782 of CEUR Workshop Proceedings, Bonn,Germany, 2011. CEUR-WS.org.

[38] V. Presutti, S. Consoli, A. G. Nuzzolese, D. R. Recupero,A. Gangemi, I. Bannour, and H. Zargayouna. Uncover-ing the semantics of wikipedia pagelinks. In K. Janowicz,S. Schlobach, P. Lambrix, and E. Hyvönen, editors, KnowledgeEngineering and Knowledge Management - Proceedings of the19th International Conference (EKAW), volume 8876 of Lec-ture Notes in Computer Science, pages 413–428, Linköping,Sweden, 2014. Springer. doi:10.1007/978-3-319-13704-9_32.

[39] V. Presutti, F. Draicchio, and A. Gangemi. Knowledge extrac-tion based on Discourse Representation Theory and LinguisticFrames. In A. ten Teije, J. Völker, S. Handschuh, H. Stuck-enschmidt, M. d’Aquin, A. Nikolov, N. Aussenac-Gilles, andN. Hernandez, editors, Knowledge Engineering and Knowl-edge Management - Proceedings of the 18th International Con-ference (EKAW), volume 7603 of Lecture Notes in ComputerScience, pages 114–129, Galway City, Ireland, 2012. Springer.doi:10.1007/978-3-642-33876-2_12.

[40] D. R. Radev, E. Hovy, and K. McKeown. Introduction to thespecial issue on summarization. Computational Linguistics,28(4):399–408, 2002. doi:10.1162/089120102762671927.

[41] G. Rizzo, R. Troncy, S. Hellmann, and M. Bruemmer. NERDmeets NIF: Lifting NLP extraction results to the linked datacloud. In C. Bizer, T. Heath, T. Berners-Lee, and M. Hausen-blas, editors, Proceedings of the 5th Workshop on Linked Dataon the Web (LDOW) co-located with the International WorldWide Web Conference (WWW), volume 937 of CEUR Work-shop Proceedings, Lyon, France, 2012. CEUR-WS.org.

[42] K. K. Schuler. VerbNet: A Broad-Coverage, ComprehensiveVerb Lexicon. PhD thesis, University of Pennsylvania, 2006.

[43] P. Singh. The public acquisition of commonsense knowledge.In J. Karlgren, editor, Proceedings of AAAI Spring Symposium:Acquiring (and Using) Linguistic (and World) Knowledge forInformation Access, Palo Alto, CA, USA, 2002. AAAI Press.

[44] K. Toutanova and H. Wu, editors. Proceedings of the 52nd An-nual Meeting of the Association for Computational Linguistics,(ACL) Volume 1: Long Papers, Baltimore, Maryland, 2014.The Association for Computer Linguistics.

[45] J. Tsujii, J. Henderson, and M. Pasca, editors. Proceedings ofthe 2012 Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural LanguageLearning (EMNLP-CoNLL), Jeju Island, Korea, 2012. The As-sociation of Computational Linguistics.

[46] H. Uszkoreit and F. Xu. From strings to things sar-graphs: Anew type of resource for connecting knowledge and language.In S. Hellmann, A. Filipowska, C. Barrière, P. N. Mendes, andD. Kontokostas, editors, Proceedings of the NLP & DBpedia

Page 30: 1 IOS Press From hyperlinks to Semantic Web properties using …semantic-web-journal.org/system/files/swj1195.pdf · Others focus on extracting facts, which are represented as simplified

30 V. Presutti et al. / From hyperlinks to Semantic Web properties

workshop co-located with the 12th International Semantic WebConference (ISWC 2013), volume 1064 of CEUR WorkshopProceedings, Sydney, Australia, 2013. CEUR-WS.org.

[47] D. Vannella, D. Jurgens, D. Scarfini, D. Toscani, and R. Nav-igli. Validating and Extending Semantic Knowledge Bases us-ing Video Games with a Purpose. In Toutanova and Wu [44],pages 1294–1304. doi:10.3115/v1/P14-1122.

[48] X. Zhang, G. Cheng, and Y. Qu. Ontology summarizationbased on rdf sentence graph. In C. Williamson and M. E.Zurko, editors, Proceedings of the 16th International confer-ence on World Wide Web, pages 707–716, Banff, Alberta, 2007.

ACM. doi:10.1145/1242572.1242668.[49] L. Zhou, C.-Y. Lin, D. S. Munteanu, and E. Hovy. ParaE-

val: Using Paraphrases to Evaluate Summaries Automatically.In R. C. Moore, J. A. Bilmes, J. Chu-Carroll, and M. Sander-son, editors, Proceedings of the Human Language TechnologyConference of the North American Chapter of the Associationof Computational Linguistics (HLT-NAACL), pages 447–454,New York, New York, 2006. The Association for Computa-

tional Linguistics.