The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated...
-
Upload
truongmien -
Category
Documents
-
view
216 -
download
2
Transcript of The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated...
![Page 1: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/1.jpg)
The KNOWLEDGESTORE: an Integrated Framework for Ontology
Population
Bernardo Magnini [email protected]
Fondazione Bruno Kessler Trento, Italy
Joint work with: Roldano Cattoni, Francesco Corcoglioniti,
Christian Girardi, Marco Rospocher, Luciano Serafini
Darmstadt, October 17, 2014
![Page 2: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/2.jpg)
NewsReader recording history
by processing massive streams of daily news
ICT 316404 FP7-ICT-2011-8
www.newsreader-project.eu
![Page 3: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/3.jpg)
HOW DID THE WORLD CHANGE YESTERDAY?
![Page 4: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/4.jpg)
Can we handle the news?
§ Information broker LexisNexis archives:
ü 1.5 millions news articles on a single working day ü 30,000 different sources
![Page 5: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/5.jpg)
How did the Car industry change
during the financial crisis?
• 6 million English articles on the car industry in the LexisNexis archive for the last 10 years
• 2 million Google hits for “Volkswagen takeover” not sorted by publication date
![Page 6: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/6.jpg)
A short history of VW and Porsche
![Page 7: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/7.jpg)
How to measure the volume of change?
7"
1995 96 97 98 99 2000 01 02 03 04 2005 06 07 08 09 2010 11 12 13 14 2015
1995 96 97 98 99 2000 01 02 03 04 2005 06 07 08 09 2010 11 12 13 14 2015
Speculation Past New New
200k mentions per year
10k entities per year
6 MILLION ARTICLES
HOW MANY CHANGES?
volum
e of
en
tity
On 16 September 2008, Porsche increased its shares by another 4.89%, in effect taking control of the company, with more than 35% of the voting rights.
6 Jan 2009 – Porsche has been on a quest to takeover VW for more than two years.
Past
![Page 8: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/8.jpg)
DAILY NEWS TSUNAMI • Volume is very big: 1,5 million items each working day
• Repeated and duplicated: we cannot distinguish new from old
• Incomplete and piecemeal: we need to read all to get a complete picture
• Actual and speculated events: we cannot distinguish the realis from irrealis (speculations, fears and hopes)
• Inconsistent and contradictory: we cannot tell true from false (who to believe)
• Opinionated and selective: we do not realize the bias of our sources
![Page 9: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/9.jpg)
Requirements for large-scale ontology population
• Cope with noisy in information extraction – E.g. event extraction is still not a consolidated technology
• Fill the gap between structured and unstructured data – Knowledge in the information extraction loop – Take advantage of “background knowledge” (e.g. wikipedia)
• Larger variety of facts (e.g. relations, events)
• Exploit cross-document co-reference
• Ontology Population as an incremental process
![Page 10: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/10.jpg)
NewsReader Architecture
![Page 11: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/11.jpg)
KnowledgeStore: Objectives • Three-layer architecture:
– resource, mention, entity • Allows to store massive amount of data about:
– entities extracted from the data – links to the data sources – background domain knowledge
• Provides access via text-based search and semantic query languages
• Provides reasoning services over the knowledge it contains
![Page 12: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/12.jpg)
The Knowledge Store: Content
• Resource Layer – Textual documents (e.g. a newspaper article) – But also images and videos – Linguistic Annotations of the textual resources (e.g. a whole
document processed by a NLP pipeline); NAF format in NewsReader
– Relevant Metadata (e.g. date, source, author, language)
Resource Layer
![Page 13: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/13.jpg)
• Mention Layer – Relevant portions of a resource document with their linguistic
annotations (e.g. names of people and places, an event mention, the area representing a person in a image)
– All orthographical variants are stored (e.g. “B. Magnini” and “Bernardo Magnini”)
– Coreference among mentions (intra and cross-document) – Links to the data sources from which mentions have been extracted – Relevant Metadata (e.g. frequency of occurrence, confidence of
mention recognition)
Resource Layer
Mention Layer
The Knowledge Store: Content
![Page 14: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/14.jpg)
• The Entity Layer – Unique Instances of mentions that have been extracted from the
data, after mention coreference – Represented as RDF triples – Orthographical variants are not recorded – Representation based on ontological schemas (e.g. SEM for
events) – Links to Background domain knowledge (e.g. wikipedia) – Relevant Metadata (e.g. factuality, provenance, confidence)
Resource Layer
Entity Layer
Mention Layer
The Knowledge Store: Content
![Page 15: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/15.jpg)
ks:Resource
uri: URIks:storedAs: ks:Representationrdfs:comment: string
ks:Mention
uri: URInif:beginIndex: intnif:endIndex: intnif:anchorOf: stringrdfs:comment: string
TimeMention
value: stringtimeType: TIMEX3TypefunctionInDocument: FunctionInDocumentquant: stringfreq: stringmod: TIMEX3ModifiertemporalFunction: bool
EventMention
eventClass: EventClasspred: stringcertainty: Certaintyfactuality: FactualityfactualityConfidence: floatpos: PartOfSpeechtense: Tenseaspect: Aspectpolarity: Polaritymodality: stringframenetRef: URIpropbankRef: URIverbnetRef: URInombankRef: URI
ks:containedIn1ks:Entity
uri: URI
ks:Axiom
uri: URIks:encodedBy: rdf:Statement[1..*]crystallized: booldc:source: URIconfidence: floatrdfs:comment: string
ks:describes1..*
0..*
NAFDocument
version: stringdct:identifier: stringlayer: NAFLayer[1..*]dct:creator: NAFProcessor[1..*]dct:language: dct:LinguisticSystem
annotationOf
1 RelationMention
EntityMention
localCorefID: string
Participation
thematicRole: stringframenetRef: URIpropbankRef: URIverbnetRef: URInombankRef: URI
CLink
TLink
relType: TLinkType
News
dct:title: stringdct:publisher: dct:Agentdct:creator: dct:Agentdct:created: datedct:spatial: dct:Locationdct:temporal: time:Intervaldct:subject: URIdct:rights: dct:RightsStatementdct:rightsHolder: dct:Agentdct:language: dct:LinguisticSystemoriginalFileName: stringoriginalFileFormat: stringoriginalPages: int
source1
target1
ks:expressedBy0..* 0..*ks:refersTo
0..*
0..1
ObjectMention
syntacticHead: stringsyntacticType: SyntacticTypeentityType: EntityTypeentityClass: EntityClass
TimeOrEventMention
SLink
ValueMention
valueType: ValueType
target 1
source 1
source 1
target 1
source1
target 1
ks:Context
uri: URIsem:hasPointOfView: sem:PointOfViewsem:hasTimeValidity: time:Interval
ks:holdsIn1
SignalMention
signal0..1
csignal0..1
ks:describedBy
ks:referredBy
ks:expresses
GLinksource 1
target 1
CSignalMention
anchorTime
beginPoint
endPoint
valueFromFunction
![Page 16: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/16.jpg)
Resource Layer
Mention Layer
Entity Layer
Men$on: M2
extent Valen$no Rossi
type PER firstname Valen$no lastname Rossi start 33
En$ty: E1
predicate object Source
type Pilot motogp.com firstName Valen:no motogp.com, R3 lastName Rossi motogp.com, R3 gender male motogp.com, R3 birthDate 1979-‐02-‐16 wikipedia.it birthPlace Urbino wikipedia.it height 182 motogp.com weight 67 motogp.com team Duca: R3
Facts $tle Rossi torna single
(Rossi is again single)
creator l’Adige
format Text
date 2012/02/27
content GOSSIP -‐ Il campione della Duca$ Valen:no Rossi e la sua fidanzata Marwa Klebi si sono lascia$. … (Duca& champion Valen&no Rossi and his girlfriend Marwa Klebi break up. …)
Resouuce: R3
Occurs in References
The Knowledge Store: Content
![Page 17: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/17.jpg)
Resource Mention
part of
Entity Statement
crystallized: boolean
EntityMention RelationMention
arguments
refers todescribed by
Resource Mention
part of
Entity Statement
crystallized: boolean
EntityMention RelationMention
arguments
refers todescribed by
Resource Mention
part of
Entity Statement
crystallized: boolean
EntityMention RelationMention
arguments
refers todescribed by
High Level Data Model
17
Resource Layer
dbpedia:United_Nations rdf:type yago:PoliticalSystems
dbpedia:United_Nations rdfs:label "United Nations"@en
dbpedia:United_Nations foaf:homepage <http://www.un.org/>
dbpedia:United_Nations
Entity Layer Mention Layer
Indonesia Hit By Earthquake
A United Nations assessment team was dispatched to the province after two quakes, measuring 7.6 and 7.4, struck west of Manokwari Jan. 4. At least five people were killed, 250 others injured and more than 800 homes destroyed by those temblors, according to the UN.
![Page 18: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/18.jpg)
Knowledge Store
KnowledgeStore: Context
Resource Layer
Mention Layer
Entity Layer
write annotations
Resource processors → tokenization & lemmatization
→ part of speech tagging, → word sense disambiguation → parsing (dep./consituency)
→ keyphrase extraction
read resources
Mention processors → named entity recognition,
→ event recognition, → semantic role labelling,
→ Tlink / Clink / Slink tagging, → entity linking (wikification…)
write mentions
read annotations & mentions
Entity processors → entity & event coreference,
→ event chaining, → event significance & rel.
→ narrative graph extraction, → crystallization read
mentions & background knowledge write entities
& statements
store background knowledge
Source 2
Source 1
other
Resource populators
. . . knowledge populators
store news
Decision Support System
(mixed) queries
18
![Page 19: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/19.jpg)
Resource annotation
dc:$tle Indonesia’s West Papua Province Hit by Earthquake (Update2)
dc:creator bloomberg.com
dc:language EN
dc:issued 2009-‐01-‐07T01:55:00-‐05:00
nfo:wordCount 287
nfo:characterCount 1778
file news_00001.txt
… …
news_00001.txt
News nwr:news_00001
19
![Page 20: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/20.jpg)
Mentions: NAF example Toyota brought Lexus to Japan in 2005.
![Page 21: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/21.jpg)
Mentions: NAF example Toyota brought Lexus to Japan in 2005.
![Page 22: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/22.jpg)
Mentions (organizations)
ORG Men:on
nwr:orgmen_00001 nif:beginIndex 146
nif:endIndex 160
nif:anchorOf United Na$ons
foaf:name United Na$ons
… …
ORG Men:on nwr:orgmen_00002 nif:anchorOf the UN
head UN
foaf:name UN
… …
Non-event entities Events Time expressions TLink signals CLink signals 22
![Page 23: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/23.jpg)
Mentions: linking to resource
dc:$tle Indonesia’s West Papua Province Hit by Earthquake (Update2)
dc:creator bloomberg.com
dc:language EN
dc:issued 2009-‐01-‐07T01:55:00-‐05:00
nfo:wordCount 287
nfo:characterCount 1778
file news_00001.txt
…. ….
News nwr:news_00001
ORG Men:on nwr:orgmen_00001 dc:isPartOf nwr:news_00001
ORG Men:on nwr:orgmen_00002 dc:isPartOf nwr:news_00001
Event Men:on nwr:evmen_00001 dc:isPartOf nwr:news_00001
Event Men:on nwr:evmen_00002 dc:isPartOf nwr:news_00001
CLINK Men:on nwr:clmen_00001 dc:isPartOf nwr:news_00001
Part. Men:on nwr:pmen_00002 dc:isPartOf nwr:news_00001
“United Nations”
“the UN”
“quakes”
“temblors”
“destroyed by those temblors”
“according to the UN”
23
![Page 24: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/24.jpg)
Entities: SEM format
ENTITY INSTANCE <http://dbpedia.org/resource/Toyota> a nwr:organization ; rdfs:label "Toyota" , "Toyota motor" ; gaf:denotedBy <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4RM.xml#char=98,104&word=w18&term=t18> , <nwr:data/cars/2013/1/1/57K5-FKK1-DYBW-2534.xml#char=44934,44940&word=w8114&term=t8114> .
![Page 25: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/25.jpg)
EVENT INSTANCE <nwr:data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#sellEvent> a sem:Event , fn:Commerce_sell ; rdfs:label "sell" ; gaf:denotedBy <nwr:data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#char=1352,1356&word=w251&term=t251> , <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#char=1536,1540&word=w275&term=t275>.
Entities: SEM format
![Page 26: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/26.jpg)
Semantic relations as named graphs
<nwr:/data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#pr25,rl55> { <nwr:data/cars/2013/1/1/5722-S821-F0J6-D48N.xml#sellEvent> sem:hasActor ; fn:Commerce_sell#Seller
<http://dbpedia.org/resource/Magyar_Suzuki> . } <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#pr46,rl114> { <nwr:data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#sellEvent> sem:hasPlace <http://dbpedia.org/resource/South_Africa> . } <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#docTime_26> { <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#sellEvent> sem:hasTime <nwr:time/2013-01-01> . }
![Page 27: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/27.jpg)
Properties of relations
PROVENANCE <nwr:data/cars/2013/1/1/57R8-5451-F0J6-D2GH.xml#pr25,rl55> gaf:denotedBy
<nwr:data/cars/2013/1/1/57R8-5451-F0J6-D2GH.xml#rl55> ; prov-o:wasAttributedTo
<nwr:sourceowner/Peru_Autos_Report> .
FACTUALITY <nwr:data/cars/2013/1/1/57K5-FKK1-DYBW-2534.xml#facValue_1125> { <nwr:data/cars/2013/1/1/57K5-FKK1-DYBW-2534.xml#sellEvent> nwr:hasFactBankValue "CT+" .}
![Page 28: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/28.jpg)
Data Sources
High-level KS architecture
KS Frontend
API implementation on top of lower layers; optional transactions & data validation
HBase + Hadoop
distributed & replicated for scalability and fault-tolerance Triple Store
possibly distributed
Mention Resource Entity Statement RDF Triples
Applications
direct access to KS API; (some) linguistic processors
HTTP REST API CRUD & bulk services, (mixed) queries, specialized access patterns, map/reduce hook
Populators
loading resources, annotations, background knowledge in specific formats (e.g., RDF, TAF)
Mgmt. Scripts
start / stop,
backup / restore, con-
figuration, statistics
gathering, Inference
Rules
Data Model Properties
Deployment Properties
Serv
er
Clien
ts
Configuration
Rule-based inference (partial) Replication
![Page 29: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/29.jpg)
Use Case 1: Trentino Media
§ Goal: acquire and load news and background knowledge
§ Multimedia news – ~56 GB of textual news, images and videos in the Italian language – acquired from 4 news providers local to the Italian Trentino region – daily updated, since 1999
Linking & en$ty crea$on
Coreference resolu$on
Men$on extrac$on
Resource preprocessing
Content acquisi$on
Provider News Images Videos l’Adige 733,738 21,525 -
VitaTrentina 33,403 14,198 -
RTTR 2,455 - 120 h
Fed. Cooperative 1,402 - -
All 770,998 35,723 120 h
Coop. Trentina
![Page 30: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/30.jpg)
Topic Contexts Persons Organizations Avg.properties Facts (triples)
sport 136 8,570 191 3.81 192115 culture 20 9,785 1 2.00 33236 justice 7 354 10 2.16 1575 economy 7 49 1,203 4.47 11147 education 6 850 82 2.35 3573 politics 535 8,402 319 4.64 98780 religion 3 1,391 0 1.67 12855 total 714 28,687 1,806 3.64 352,244
Linking & en$ty crea$on
Coreference resolu$on
Men$on extrac$on
Resource preprocessing
Content acquisi$on
§ Background knowledge • acquired from Italian Wikipedia, sport-related community sites, local and
national level public administrations and government bodies • manual acquisition using ad-hoc Web site wrappers
Use Case 1: Trentino Media
![Page 31: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/31.jpg)
Trentino Media: Resource Preprocessing
§ Goal: normalize and annotate news so to ease further processing § Several operations:
– Conversion of multimedia resources to common formats – Segmentation of complex news
• separation of individual stories in a news broadcast • separation of figures and captions in complex XML news
articles
– Automatic Speech Recognition – Annotations of news with linguistic taggers
• part of speech tagging, based on TextPro tool [Pianta08] • temporal expression tagging, based on TextPro • key concept extraction, based on KX tool [Pianta10]
Linking & en$ty crea$on
Coreference resolu$on
Men$on extrac$on
Resource preprocessing
Content acquisi$on
![Page 32: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/32.jpg)
Trentino Media: Mention Extraction
§ Mention extraction is based on the TextPro Mention Detection module – system based on the supervised training of a statistical model – training for the Italian language based on the Italian Content Annotation
Bank (I-CAB) dataset [Magnini06]; measured F1 value: 82% § Mention extraction statistics
Linking & en$ty crea$on
Coreference resolu$on
Men$on extrac$on
Resource preprocessing
Content acquisi$on
News provider
PER mentions
ORG mentions GPE/LOC mentions
Total mentions
l’Adige 5,387,994 3,100,994 3,052,011 11,540,999 VitaTrentina 144,486 100,789 136,611 381,886 RTTR 19,290 15,493 27,404 62,187 Fed. Coop. 14,404 12,731 8,513 35,648 All 5,566,174 3,230,007 3,224,539 12,020,720
![Page 33: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/33.jpg)
§ Goal: put mentions referring to the same entity in a mention cluster – cross-document (more complex) as mentions may belong to different
news
Trentino Media: Coreference Resolution Linking & en$ty
crea$on Coreference resolu$on
Men$on extrac$on
Resource preprocessing
Content acquisi$on
Entity type PER ORG GPE/LOC Total Mention clusters 340,147 16,649 52,478 409,274
… il presidente della Provincia di Trento Lorenzo Dellai ha lodato l'iniziativa per molteplici aspetti ...
... questa mattina il presidente Dellai ha incontrato le parti sociali sulla manovra che la Giunta sta mettendo a punto … … e intanto l' Alto Adige è tra le regioni europee con il più basso tasso di disoccupazione
Lorenzo Dellai (PERSON entity)
![Page 34: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/34.jpg)
Trentino Media: Linking & Entity Creation
§ Goal: link mention clusters to entities in the background knowledge and to external resources; create new entities from unlinked clusters
§ Different types of linking are performed – linking to GeoNames toponyms of GPE/LOC clusters, using GeoCoder – linking to background knowledge entities of PER and ORG clusters – linking to Wikipedia pages of all the entities, using the WikiMachine tool
§ New entities are created from unlinked clusters
§ Linking and entity statistics:
Linking & en$ty crea$on
Coreference resolu$on
Men$on extrac$on
Resource preprocessing
Content acquisi$on
Entity type PER ORG GPE/LOC Total Linked clusters 5.03% 7.96% 48.64% 10.74% Linked mentions 22.36% 12.02% 65.04% 31.03% Resulting entities 321,713 17,129 52,478 421,320
![Page 35: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/35.jpg)
Trentino Media: Linking & Entity Creation
§ Linking to background knowledge is context-driven [Tamilin10] – exploits the contextual organization of knowledge in the KNOWLEDGESTORE – 84,5% accuracy (i.e., correctly linked clusters) on gold standard of 298
clusters
Linking & en$ty crea$on
Coreference resolu$on
Men$on extrac$on
Resource preprocessing
Content acquisi$on
-, World, -
Research, World, -
Research, Italy, 2009 Politics, Trento, 2009 Politics,Civezzano,2009
Politics,Trentino,2009 Politics,Trentino,2004
EIT L. Dellai (pres.)
S. Dellai FBK
L. Dellai (major)
“Dellai” (PER)
➌ Link the mention only to entities appearing in the selected formal contexts
topic: research location: Trentino time: 2009
➋ Select a ranked list of KNOWLEDGESTORE contexts that better match chosen dimensional values
➊ Map textual context of each cluster mention to appropriate values of contextual dimensions
![Page 36: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/36.jpg)
Other use cases • USE CASES:
– Car Industry news (2003-2013): 63K articles, 1,7M event instances, 445K actors, 63K places, 41K DBpedia entities and 46M triples.
– TechCrunch (2005-2013): 43K articles, 1.6M event instances, 300K actors, 28K DBpedia entities and 24M triples.
– Fifa World Cup: 200K documents, 9M Events, 35K actors (30% pers, 70% org), 15K places, 6K dates and 136M triples.
– Dutch House of Representatives Bank Inquiry, 1M documents (900K XML and 100K PDF), pending
• BENCHMARK DATA: – WikiNews: 19K English, 8K Italian, 7K Spanish and 1K Dutch. 69 Apple news documents for annotation.
– ECB+: 43 topics and 482 articles from GoogleNews, extended with 502 GoogleNews articles for 43+ topics (similar but different event).
![Page 37: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/37.jpg)
![Page 38: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/38.jpg)
![Page 39: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/39.jpg)
TOP events per year
![Page 40: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/40.jpg)
TOP actors per year
![Page 41: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/41.jpg)
TOP places per year
![Page 42: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/42.jpg)
CARS: WHERE & WHEN
![Page 43: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/43.jpg)
Use Case: Worldcup • We processed 212,511 news articles in about 3 weeks -> BBC,
Guardian, Lexis Nexis
• We stored the resources and the result in the KnowledgeStore:
• 45GB storage of news and processed mentions
• 22GB storage for triples expressing statements on instances in RDF, e.g. who involved, when happened, where…
• 136,075,006 triples from news (event statements)
• 104,595,567 triples from background data (DBPedia statements)
![Page 44: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/44.jpg)
Events Actors Places Time
Mention 25,470,763 3,590,012 58% org., 42% pers. 2,279,939 1,828,134
Instance 9,387,356 51,283 dbp 30% org., 70% pers. 15,219 dbp 5,961
33,123 events involving Blatter
Use Case: Worldcup
![Page 45: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/45.jpg)
191 stand 196 suggest 208 comment 209 want 232 insist 254 meet 260 take 272 give 370 tell 408 add 440 make 523 have 1143 say
135 claim 138 run 139 ask 140 play 145 write 160 vote 162 challenge 162 do 169 come 172 get 180 call 180 visit 188 go
116 seek 116 speak 117 re-elect 119 describe 121 believe 122 attend 125 confirm 127 support 129 try 129 win 131 look 132 announce 133 use
What Blatter does…
![Page 46: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/46.jpg)
KS Web site: https://knowledgestore.fbk.eu
§ Download source code, selected data § Documentation, demo, video
![Page 47: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/47.jpg)
Semeval 2015 Task T4 - TimeLine: Cross-Document Event Ordering
- Task description:build an ordered list ofevents involving a specificpre-selected entity
- Trial and test data:Wikinews
- Evaluation:(i) time anchors + ordering,(ii) only ordering
- Subtasks:(i) timeline on raw texts,(ii) timeline on textsannotated with events
Register to the TimeLine task Google Group: semeval-task4-timeline
Organizers: A-L. Minard, E. Agirre, I. Aldabe, M. van Erp, B. Magnini, G. Rigau, M. Speranza, R. Urizar
http://www.newsreader-project.eu
![Page 48: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/48.jpg)
Conclusion • Our contributions
– integration of techniques and technologies from different fields (NLP, Semantic Web, Machine Learning, …)
– large scale application (some use cases) – tight interlinking of knowledge and multimedia
• Further research directions – Improve event extraction – Exploit already stored knowledge for event Cross-document co-
reference – Incremental population of the KnowledgeStore – Mixed retrieval, both from entities (Sparql) and mentions (textual
indexing)
![Page 49: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/49.jpg)
References 2012: R. Cattoni, F. Corcoglioniti, C. Girardi, B. Magnini, L. Serafini, R. Zanoli:
Anchoring Background Knowledge to Rich Multimedia Contexts in the KnowledgeStore, In: New Trends of Research in Ontologies and Lexical Resources, edited by Federica Corradi Dell’Acqua, Qin Lu, Piek Vossen, Alessandro Oltramari and Eduard Hovy, Springer.
2012: R. Cattoni, F. Corcoglioniti, C. Girardi, B. Magnini, L. Serafini, R. Zanoli
TrentinoMedia: Exploiting NLP and Background Knowledge to Browse a Large Multimedia News Store. PAI 2012, Popularizing Artificial Intelligence, Rome June 15, 2012.
2012: R. Cattoni, F. Corcoglioniti, C. Girardi, B. Magnini, L. Serafini, R. Zanoli
The KnowledgeStore: an Entity-Based Storage System, LREC 2012. 2013: F. Corcoglioniti, M. Rospocher, R. Cattoni, B. Magnini, and L. Serafini: Interlinking Unstructured and
Structured Knowledge in an Integrated Framework, Proceedings of Seventh IEEE International Conference on Semantic Computing, ICSC 2013, September 16-18, 2013, Irvine, California, USA.
![Page 50: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int](https://reader036.fdocuments.net/reader036/viewer/2022062908/5ac702347f8b9a2b5c8e92eb/html5/thumbnails/50.jpg)
References
[Buscaldi10] Buscaldi, D., Magnini, B.: Grounding toponyms in an Italian local news corpus. In: Proc. of 6th Workshop on Geographic Information Retrieval. pp. 15:1–15:5. GIR ’10 (2010)
[Homola10] Homola, M., Tamilin, A., Serafini, L.: Modeling contextualized knowledge. In: Proc. of 2nd Int. Workshop on Context, Information And Ontologies. CIAO ’10, vol. 626 (2010)
[Magnini06] Magnini, B., Pianta, E., Girardi, C., Negri, M., Romano, L., Speranza, M., Bartalesi Lenzi, V., Sprugnoli, R.: I-CAB: the Italian Content Annotation Bank. In: Proc. of 5th Int. Conf. on Language Resources and Evaluation, LREC ’06 (2006)
[Pianta08] Pianta, E., Girardi, C., Zanoli, R.: The TextPro tool suite. In: Proc. of 6th Int. Conf. on Language Resources and Evaluation. LREC ’08 (2008)
[Pianta10] Pianta, E., Tonelli, S.: KX: A flexible system for keyphrase extraction. In: Proc. of 5th Int. Workshop on Semantic Evaluation, SemEval ’10, pp. 170–173 (2010)
[Zanoli12] Zanoli, R., Corcoglioniti, F., Girardi, C.: Exploiting background knowledge for clustering person names. In: Proc. of Evalita 2011 – Evaluation of NLP and Speech Tools for Italian (2012), to appear
[Tamilin10] Tamilin, A., Magnini, B., Serafini, L.: Leveraging entity linking by contextualized background knowledge: A case study for news domain in Italian. In: Proc. of 6th Workshop on Semantic Web Applications and Perspectives. SWAP ’10 (2010)