Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

28
Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677 Michael Aleythe, Martin Voigt, Peter Wehner

description

information extraction, modelling and storage of semantic data to recognize trending topics for journalism and newspaper offices

Transcript of Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Page 1: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677

Michael Aleythe, Martin Voigt, Peter Wehner

Page 2: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Structure

Motivation, Problems, and Goals

Topic/S Workflow

Demo

Conclusion

Friday, 06.09.2013 Topic/S Slide 1

Page 3: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Motivation

Newsroom

Friday, 06.09.2013 Topic/S Slide 2

Quelle: ringier.com

Page 4: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Problem

Overwhelming amount of data

e.g., WAZ 5000 articles/day from agencies and in-house production

Friday, 06.09.2013 Topic/S

DPA

Reuters

KNA

Twitter

Facebook

Blogs

News agencies Web, social media

In-house production

Archive

Online

Slide 3

Page 5: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Vision

Automatic topic discovery using Named Entities and other keywords (Semantic Items, SemItem)

Investigation of trending topics

Push them to the editor

Friday, 06.09.2013 Topic/S

MA1

E1

E2

E4

E3

E7

E6

E5MA2

Media Assets

Named Entities

Pre-Processing

MA1

E1

T1E2

E4

E3

E7

E6

T2

T3

E5MA2

Media Assets

Named Entities

Topics

Pre-Processing Post-Processing

Slide 4

Page 6: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Structure

Motivation, Problems, and Goals

Topic/S Workflow

– Overview

– Information Extraction

– Storage

– Topic Detection

Demo

Conclusion

Friday, 06.09.2013 Topic/S Slide 5

Page 7: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Workflow

Friday, 06.09.2013 Topic/S Slide 6

Page 8: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Workflow: Preprocessor

Friday, 06.09.2013 Topic/S

Language Recognition (Ger/Eng)

Rule based

Named Entity Extraction

word list + statistics

Keyword Extraction

Lemmatization, word list

Categorisation

Source based

Slide 7

Source: onelanguageoneposter.com

Page 9: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Structure

Motivation, Problems, and Goals

Topic/S Workflow

– Overview

– Information Extraction

– Storage

– Topic Detection

Demo

Conclusion

Friday, 06.09.2013 Topic/S Slide 8

Page 10: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Semantic Model

Friday, 06.09.2013 Topic/S Slide 9

Page 11: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Semantic Facts

Named Entities required but no lists available

Stored preferred and alternative names

ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller

Names: Rene Muller, Rene Müller, René Muller, René Müller

Triples without SemItems: 27,6 Mio.

Friday, 06.09.2013 Topic/S Slide 10

SemItem Number (with alt. names)

Person 1.504.341 (2.499.962)

Organization 63.332 (98.127)

Place 89.702 (95.178)

Keyword 1351

Page 12: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Storage of Semantic Data

Using Oracle 11gR2 Pros

Already available, existing knowledge

Integrated querying of relational and semantic data

Cons

Inference

Incomplete SPARQL 1.1 support

Limited custom rule support

Benchmark of triple stores [Voigt2012]

Friday, 06.09.2013 Topic/S Slide 11

Page 13: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Structure

Motivation, Problems, and Goals

Topic/S Workflow

– Overview

– Information Extraction

– Storage

– Topic Detection

Demo

Conclusion

Friday, 06.09.2013 Topic/S Slide 12

Page 14: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Workflow: Topic Detection

Friday, 06.09.2013 Topic/S

Clustering

Slide 13

Page 15: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Workflow: Topic Detection

Friday, 06.09.2013 Topic/S

Clustering

Slide 14

Page 16: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Workflow: Topic Detection

Friday, 06.09.2013 Topic/S

Clustering

Merkel

Politics

Highway

Traffic

Audi

Obama

Slide 15

Page 17: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Workflow: Topic Detection

Friday, 06.09.2013 Topic/S

Clustering (Top Cluster 25.08.2013)

Article Name HotTopic

43 Bundesliga, Fußball, Spieltag , 1. FC Union Berlin, SC Paderborn 07 eV, FC Augsburg, FSV Frankfurt

Yes

25 Euro, SPD, Berlin, Griechenland, FDP, CDU, Deutschland

Yes

19 Bericht, Diplomat, Google Inc , Anbieter, Berlin, Deutschland, Auto

Yes

18 Veranstaltung, Bernd Lucke, Angreifer, Berlin, Polizei, Angriff, Deutschland

Yes

15 Gericht, Prozess, Bo Xilai, Christian Wulff, Anklage, Verfahren, Mord

Yes

Slide 16

Page 18: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Structure

Motivation, Problems, and Goals

Topic/S Workflow

Demo

Conclusion

Friday, 06.09.2013 Topic/S Slide 17

Page 19: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Live Demo

Friday, 06.09.2013 Topic/S Slide 18

Page 20: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Structure

Motivation, Problems, and Goals

Topic/S Workflow

Demo

Conclusion

Friday, 06.09.2013 Topic/S Slide 19

Page 21: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Sum it up!

Result

Identifying topics and pushing them to the editor

Lessons learned

NER: bad for non-English, combination required

model needs to be optimized for queries

dedicated user interface required

Outlook

prediction of topics with causal/temporal relations

Friday, 06.09.2013 Topic/S Slide 20

Quelle: ooltapulta.com

Quelle: business-strategy-innovation.com

Page 22: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677

Thanks! Questions?

Page 23: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Workflow: Preprocessor

Friday, 06.09.2013 Topic/S

Named Entity Recognition

word list

Tool: LingPipe + Extension

Sources: LOD (DBPedia, Geonames, YAGO2, GND)

Advantages: controlled vocabulary, guarantied recognition of entities

statistics

Tool: Stanford NLP

Source: pre-trained model

Advantage: Recognition of unknown entities

Slide 22

Quelle: churchthought.com

Page 24: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Workflow: Preprocessor

Friday, 06.09.2013 Topic/S

Categorization

Politics

Article DPA IPTC Media Topic

Categoriser OTS

Categoriser DPA

Categoriser Reuters

Slide 23

Page 25: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Workflow: Preprocessor

Friday, 06.09.2013 Topic/S

Categorization - Quality

News-Agency accuracy

KNA 80,3 %

DPA 94,4 %

EPD 80,3 %

Reuters 90,8 %

OTS 93,5 %

AFP 86 %

Method accuracy

One cat. for all agencies 85 %

One cat. per agency 87,5 %

Slide 24

Page 26: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Workflow: Preprocessor

Friday, 06.09.2013 Topic/S

Keywords

Lemmatization

Developing a word list

Extraction using the word list

Bonus: frequent terms of an article

Slide 25

Quelle: hugdaily.org

Page 27: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Disambiguation

Friday, 06.09.2013 Topic/S Slide 26

Quelle: fansshare.com Quelle: lounge.espdisk.com

Quelle: de.wikipedia.org

Page 28: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Disambiguation

Problem: not all SemItems available in the LOD

Friday, 06.09.2013 Topic/S

Michael Jackson

Beer

Michael Jackson

Beer

Whiskey

Michael Jackson

Music

King of Pop

Internal Facts

External Facts (DBpedia, etc.)

Identification of Entity Cluster

Slide 27