Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
-
Upload
fink-partner-media-services-gmbh -
Category
Technology
-
view
150 -
download
0
description
Transcript of Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677
Michael Aleythe, Martin Voigt, Peter Wehner
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 1
Motivation
Newsroom
Friday, 06.09.2013 Topic/S Slide 2
Quelle: ringier.com
Problem
Overwhelming amount of data
e.g., WAZ 5000 articles/day from agencies and in-house production
Friday, 06.09.2013 Topic/S
DPA
Reuters
KNA
Blogs
…
News agencies Web, social media
…
In-house production
Archive
Online
Slide 3
Vision
Automatic topic discovery using Named Entities and other keywords (Semantic Items, SemItem)
Investigation of trending topics
Push them to the editor
Friday, 06.09.2013 Topic/S
MA1
E1
E2
E4
E3
E7
E6
E5MA2
Media Assets
Named Entities
Pre-Processing
MA1
E1
T1E2
E4
E3
E7
E6
T2
T3
E5MA2
Media Assets
Named Entities
Topics
Pre-Processing Post-Processing
Slide 4
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Information Extraction
– Storage
– Topic Detection
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 5
Workflow
Friday, 06.09.2013 Topic/S Slide 6
Workflow: Preprocessor
Friday, 06.09.2013 Topic/S
Language Recognition (Ger/Eng)
Rule based
Named Entity Extraction
word list + statistics
Keyword Extraction
Lemmatization, word list
Categorisation
Source based
Slide 7
Source: onelanguageoneposter.com
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Information Extraction
– Storage
– Topic Detection
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 8
Semantic Model
Friday, 06.09.2013 Topic/S Slide 9
Semantic Facts
Named Entities required but no lists available
Stored preferred and alternative names
ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller
Names: Rene Muller, Rene Müller, René Muller, René Müller
Triples without SemItems: 27,6 Mio.
Friday, 06.09.2013 Topic/S Slide 10
SemItem Number (with alt. names)
Person 1.504.341 (2.499.962)
Organization 63.332 (98.127)
Place 89.702 (95.178)
Keyword 1351
Storage of Semantic Data
Using Oracle 11gR2 Pros
Already available, existing knowledge
Integrated querying of relational and semantic data
Cons
Inference
Incomplete SPARQL 1.1 support
Limited custom rule support
Benchmark of triple stores [Voigt2012]
Friday, 06.09.2013 Topic/S Slide 11
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Information Extraction
– Storage
– Topic Detection
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 12
Workflow: Topic Detection
Friday, 06.09.2013 Topic/S
Clustering
Slide 13
Workflow: Topic Detection
Friday, 06.09.2013 Topic/S
Clustering
Slide 14
Workflow: Topic Detection
Friday, 06.09.2013 Topic/S
Clustering
Merkel
Politics
Highway
Traffic
Audi
Obama
Slide 15
Workflow: Topic Detection
Friday, 06.09.2013 Topic/S
Clustering (Top Cluster 25.08.2013)
Article Name HotTopic
43 Bundesliga, Fußball, Spieltag , 1. FC Union Berlin, SC Paderborn 07 eV, FC Augsburg, FSV Frankfurt
Yes
25 Euro, SPD, Berlin, Griechenland, FDP, CDU, Deutschland
Yes
19 Bericht, Diplomat, Google Inc , Anbieter, Berlin, Deutschland, Auto
Yes
18 Veranstaltung, Bernd Lucke, Angreifer, Berlin, Polizei, Angriff, Deutschland
Yes
15 Gericht, Prozess, Bo Xilai, Christian Wulff, Anklage, Verfahren, Mord
Yes
Slide 16
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 17
Live Demo
Friday, 06.09.2013 Topic/S Slide 18
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 19
Sum it up!
Result
Identifying topics and pushing them to the editor
Lessons learned
NER: bad for non-English, combination required
model needs to be optimized for queries
dedicated user interface required
Outlook
prediction of topics with causal/temporal relations
Friday, 06.09.2013 Topic/S Slide 20
Quelle: ooltapulta.com
Quelle: business-strategy-innovation.com
Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677
Thanks! Questions?
Workflow: Preprocessor
Friday, 06.09.2013 Topic/S
Named Entity Recognition
word list
Tool: LingPipe + Extension
Sources: LOD (DBPedia, Geonames, YAGO2, GND)
Advantages: controlled vocabulary, guarantied recognition of entities
statistics
Tool: Stanford NLP
Source: pre-trained model
Advantage: Recognition of unknown entities
Slide 22
Quelle: churchthought.com
Workflow: Preprocessor
Friday, 06.09.2013 Topic/S
Categorization
Politics
Article DPA IPTC Media Topic
Categoriser OTS
Categoriser DPA
Categoriser Reuters
Slide 23
Workflow: Preprocessor
Friday, 06.09.2013 Topic/S
Categorization - Quality
News-Agency accuracy
KNA 80,3 %
DPA 94,4 %
EPD 80,3 %
Reuters 90,8 %
OTS 93,5 %
AFP 86 %
Method accuracy
One cat. for all agencies 85 %
One cat. per agency 87,5 %
Slide 24
Workflow: Preprocessor
Friday, 06.09.2013 Topic/S
Keywords
Lemmatization
Developing a word list
Extraction using the word list
Bonus: frequent terms of an article
Slide 25
Quelle: hugdaily.org
Disambiguation
Friday, 06.09.2013 Topic/S Slide 26
Quelle: fansshare.com Quelle: lounge.espdisk.com
Quelle: de.wikipedia.org
Disambiguation
Problem: not all SemItems available in the LOD
Friday, 06.09.2013 Topic/S
Michael Jackson
Beer
Michael Jackson
Beer
Whiskey
Michael Jackson
Music
King of Pop
Internal Facts
External Facts (DBpedia, etc.)
Identification of Entity Cluster
Slide 27