High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT...

57
High quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Transcript of High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT...

Page 1: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

High quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Page 2: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

About Us

(c) 2014 I IntraFind Software AG 2

IntraFind Software AG Elasticsearch Partner (we also do consulting)

Specialist for Information Retrieval and Text Analytics

Founded 2000, 30 employees

More than 850 customers mainly in Germany, Austria, and Switzerland

Lucene Committers: B. Messer, C. Goller

Independent Software Vendor, entirely self-financed

Products are a combination of Open Source Components and in-house Development

High quality Linguistic Analyzers for most European Languages (also available as Solr and Elasticsearch plugins)

Named Entity Recognition

Text Classification

Tagging Service – extraction of semantic meta data

Page 3: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Outline

1. The ZEIT Online Project 2010 tagging and making the archive searchable

2. Editorial Workflow @ ZEIT Online

3. Feedback from the Editors

4. Meeting the Expectations

(c) 2014 I IntraFind Software AG 3

Page 4: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

The ZEIT Online Project

(c) 2014 I IntraFind Software AG 4

Page 5: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

The ZEIT Online Project

Die ZEIT is a weekly newspaper founded 1946, one of the most renowned in Germany

ZEIT Online, the web edition, exists since 1996

(c) 2014 I IntraFind Software AG 5

Page 6: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

The ZEIT Online Project

Die ZEIT is a weekly newspaper founded 1946, one of the most renowned in Germany

ZEIT Online, the web edition, exists since 1996

2010 organize entire archive based on semantic meta data and make it searchable

(c) 2014 I IntraFind Software AG 6

Page 7: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

The ZEIT Online Project

Die ZEIT is a weekly newspaper founded 1946, one of the most renowned in Germany

ZEIT Online, the web edition, exists since 1996

2010 organize entire archive based on semantic meta data and make it searchable

Persons, locations and organizations mentioned

(c) 2014 I IntraFind Software AG 7

Page 8: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

The ZEIT Online Project

Die ZEIT is a weekly newspaper founded 1946, one of the most renowned in Germany

ZEIT Online, the web edition, exists since 1996

2010 organize entire archive based on semantic meta data and make it searchable

Persons, locations and organizations mentioned

Statistically significant keywords

(c) 2014 I IntraFind Software AG 8

Page 9: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

The ZEIT Online Project

Die ZEIT is a weekly newspaper founded 1946, one of the most renowned in Germany

ZEIT Online, the web edition, exists since 1996

2010 organize entire archive based on semantic meta data and make it searchable

Persons, locations and organizations mentioned

Statistically significant keywords

Classification into corresponding department

(c) 2014 I IntraFind Software AG 9

Page 10: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

The ZEIT Online Project

Amazingly, there is an API for accessing this tagged content! See developer.zeit.de

(c) 2014 I IntraFind Software AG 10

Page 11: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

(c) 2014 I IntraFind Software AG 11

Page 12: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

Second step in the project was to integrate the content tagging system into the editorial workflow @ ZEIT Online

(c) 2014 I IntraFind Software AG 12

Page 13: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

Second step in the project was to integrate the content tagging system into the editorial workflow @ ZEIT Online

(c) 2014 I IntraFind Software AG 13

Page 14: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

Second step in the project was to integrate the content tagging system into the editorial workflow @ ZEIT Online

(c) 2014 I IntraFind Software AG 14

Page 15: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

Second step in the project was to integrate the content tagging system into the editorial workflow @ ZEIT Online

(c) 2014 I IntraFind Software AG 15

Page 16: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

Second step in the project was to integrate the content tagging system into the editorial workflow @ ZEIT Online

(c) 2014 I IntraFind Software AG 16

Page 17: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

Second step in the project was to integrate the content tagging system into the editorial workflow @ ZEIT Online

(c) 2014 I IntraFind Software AG 17

Page 18: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

(c) 2014 I IntraFind Software AG 18

Page 19: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

It's not as simple as that

Keywords will be visible to humans! you cannot rely on a robot's good judgement and publish everything that comes out…

(c) 2014 I IntraFind Software AG 19

Page 20: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

It's not as simple as that

Keywords will be visible to humans! you cannot rely on a robot's good judgement and publish everything that comes out…

Ever heard of "inter-indexer consistency"? it probably wouldn't work letting every editor choose freely

(c) 2014 I IntraFind Software AG 20

Page 21: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

It's not as simple as that

Keywords will be visible to humans! you cannot rely on a robot's good judgement and publish everything that comes out…

Ever heard of "inter-indexer consistency"? it probably wouldn't work letting every editor choose freely

Solution:

curated list of allowed keywords

AND editor picks a subset of allowed keywords for the article

(c) 2014 I IntraFind Software AG 21

Page 22: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

It's not as simple as that

Keywords will be visible to humans! you cannot rely on a robot's good judgement and publish everything that comes out…

Ever heard of "inter-indexer consistency"? it probably wouldn't work letting every editor choose freely

Solution:

curated list of allowed keywords

AND editor picks a subset of allowed keywords for the article

Curating the keyword list is expensive

… going through large lists of keyword candidates also

(c) 2014 I IntraFind Software AG 22

Page 23: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Editorial Workflow @ ZEIT Online

It's not as simple as that

Keywords will be visible to humans! you cannot rely on a robot's good judgement and publish everything that comes out…

Ever heard of "inter-indexer consistency"? it probably wouldn't work letting every editor choose freely

Solution:

curated list of allowed keywords

AND editor picks a subset of allowed keywords for the article

Curating the keyword list is expensive

… going through large lists of keyword candidates also we want to solve this problem

(c) 2014 I IntraFind Software AG 23

Page 24: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Feedback from the editorial staff

(c) 2014 I IntraFind Software AG 24

Page 25: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Feedback from the editorial staff

Tradeoff: relevance vs. completeness

(c) 2014 I IntraFind Software AG 25

Page 26: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Feedback from the editorial staff

Tradeoff: relevance vs. completeness

(c) 2014 I IntraFind Software AG 26

generic better than specific (Stuxnet vs. Stuxnet-Virus) expand to similar keywords (Prism NSA) no 'stop-keywords' (e.g. Angela Merkel) no out-of-context keywords consider trends!

Page 27: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Feedback from the editorial staff

Tradeoff: relevance vs. completeness

(c) 2014 I IntraFind Software AG 27

generic better than specific (Stuxnet vs. Stuxnet-Virus) expand to similar keywords (Prism NSA) no 'stop-keywords' (e.g. Angela Merkel) no out-of-context keywords consider trends!

all possible keywords, don't miss anything!

Page 28: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Feedback from the editorial staff

Tradeoff: relevance vs. completeness

(c) 2014 I IntraFind Software AG 28

generic better than specific (Stuxnet vs. Stuxnet-Virus) expand to similar keywords (Prism NSA) no 'stop-keywords' (e.g. Angela Merkel) no out-of-context keywords consider trends!

all possible keywords, don't miss anything!

Oh, and please don't make us work more with your changes.

Page 29: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations

(c) 2014 I IntraFind Software AG 29

Page 30: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations

Provide a perfect ranking of keywords

(c) 2014 I IntraFind Software AG 30

Page 31: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations

Provide a perfect ranking of keywords

This allows us to present only the relevant keywords to the editor

(c) 2014 I IntraFind Software AG 31

Page 32: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations

Provide a perfect ranking of keywords

This allows us to present only the relevant keywords to the editor

… and we still have all possible keywords for the archive

(c) 2014 I IntraFind Software AG 32

Page 33: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Baseline Scoring

First problem: how do we compare apples and bananas? (different sorts of entities and keywords)

(c) 2014 I IntraFind Software AG 33

Page 34: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Baseline Scoring

First problem: how do we compare apples and bananas? (different sorts of entities and keywords)

We will compute the document hit count in the archive by searching for each tag found

(c) 2014 I IntraFind Software AG 34

Page 35: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Baseline Scoring

First problem: how do we compare apples and bananas? (different sorts of entities and keywords)

We will compute the document hit count in the archive by searching for each tag found

We can rely on our linguistic analyzers to account for different forms of the same tag: e.g. „Bundeswirtschaftsminister“ == „Bundesminister für Wirtschaft“

(c) 2014 I IntraFind Software AG 35

Page 36: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Baseline Scoring

First problem: how do we compare apples and bananas? (different sorts of entities and keywords)

We will compute the document hit count in the archive by searching for each tag found

We can rely on our linguistic analyzers to account for different forms of the same tag: e.g. „Bundeswirtschaftsminister“ == „Bundesminister für Wirtschaft“

Use a Lucene Similarity to compute the TFIDF of each tag

(c) 2014 I IntraFind Software AG 36

Page 37: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Baseline Scoring

First problem: how do we compare apples and bananas? (different sorts of entities and keywords)

We will compute the document hit count in the archive by searching for each tag found

We can rely on our linguistic analyzers to account for different forms of the same tag: e.g. „Bundeswirtschaftsminister“ == „Bundesminister für Wirtschaft“

Use a Lucene Similarity to compute the TFIDF of each tag

(c) 2014 I IntraFind Software AG 37

Page 38: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Baseline Scoring

First problem: how do we compare apples and bananas? (different sorts of entities and keywords)

We will compute the document hit count in the archive by searching for each tag found

We can rely on our linguistic analyzers to account for different forms of the same tag: e.g. „Bundeswirtschaftsminister“ == „Bundesminister für Wirtschaft“

Use a Lucene Similarity to compute the TFIDF of each tag

(c) 2014 I IntraFind Software AG 38

might hurt context

Page 39: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Context Scoring

Idea: compare the document with other documents containing a particular tag

(c) 2014 I IntraFind Software AG 39

Page 40: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Context Scoring

Idea: compare the document with other documents containing a particular tag

compute typical contexts of tag

(c) 2014 I IntraFind Software AG 40

Page 41: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Context Scoring

Idea: compare the document with other documents containing a particular tag

compute typical contexts of tag

these contexts are a kind of prototypical document for all documents containing the keyword

(c) 2014 I IntraFind Software AG 41

Page 42: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Context Scoring

Idea: compare the document with other documents containing a particular tag

compute typical contexts of tag

these contexts are a kind of prototypical document for all documents containing the keyword

we compare the current context with this prototypical context, i.e. we compute a similarity

(c) 2014 I IntraFind Software AG 42

Page 43: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Context Scoring

Idea: compare the document with other documents containing a particular tag

compute typical contexts of tag

these contexts are a kind of prototypical document for all documents containing the keyword

we compare the current context with this prototypical context, i.e. we compute a similarity

(c) 2014 I IntraFind Software AG 43

Page 44: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Context Scoring

Idea: compare the document with other documents containing a particular tag

compute typical contexts of tag

these contexts are a kind of prototypical document for all documents containing the keyword

we compare the current context with this prototypical context, i.e. we compute a similarity

We can use the same method to expand our tags with related keywords!

(c) 2014 I IntraFind Software AG 44

Page 45: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Trend Scoring

But what if the mention of "Schweinsteiger" is not incidental? Maybe it's world cup time?

(c) 2014 I IntraFind Software AG 45

Page 46: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Trend Scoring

But what if the mention of "Schweinsteiger" is not incidental? Maybe it's world cup time?

In our case, trend is a measure of variation of hit counts in a timespan

We can compute trends from our archive, by counting hits in different timespans

(c) 2014 I IntraFind Software AG 46

Page 47: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Trend Scoring

But what if the mention of "Schweinsteiger" is not incidental? Maybe it's world cup time?

In our case, trend is a measure of variation of hit counts in a timespan

We can compute trends from our archive, by counting hits in different timespans

(c) 2014 I IntraFind Software AG 47

Page 48: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Trend Scoring

But what if the mention of "Schweinsteiger" is not incidental? Maybe it's world cup time?

In our case, trend is a measure of variation of hit counts in a timespan

We can compute trends from our archive, by counting hits in different timespans

(c) 2014 I IntraFind Software AG 48

Page 49: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Consolidating Scores

(c) 2014 I IntraFind Software AG 49

Page 50: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Consolidating Scores

We combine the scores by

1. Individually scaling them onto the same interval

(c) 2014 I IntraFind Software AG 50

Page 51: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Consolidating Scores

We combine the scores by

1. Individually scaling them onto the same interval

2. Multiplying each one by a weight

(c) 2014 I IntraFind Software AG 51

Page 52: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Consolidating Scores

We combine the scores by

1. Individually scaling them onto the same interval

2. Multiplying each one by a weight

3. Summing up and again scaling the result

(c) 2014 I IntraFind Software AG 52

Page 53: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Consolidating Scores

We combine the scores by

1. Individually scaling them onto the same interval

2. Multiplying each one by a weight

3. Summing up and again scaling the result

There's a lot to configure, and there is no such thing as the perfect configuration

(c) 2014 I IntraFind Software AG 53

Page 54: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Meeting the Expectations Consolidating Scores

We combine the scores by

1. Individually scaling them onto the same interval

2. Multiplying each one by a weight

3. Summing up and again scaling the result

There's a lot to configure, and there is no such thing as the perfect configuration

ZEIT Online has the freedom to fine-tune the ranking

(c) 2014 I IntraFind Software AG 54

Page 55: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Summary

Requirements of an editorial office on a tagging system are complex

Tradeoff between relevance and completeness of tags

You need both. We can solve this problem the same way information retrieval systems have ranking

There is a lot one can do to enrich tags only by looking at a representative archive

(c) 2014 I IntraFind Software AG 55

Page 56: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

Thanks for Listening

(c) 2014 I IntraFind Software AG 56

Thanks to Ron Drongowski and the ZEIT Online team!

Breno Faria (@brealbfar) & Christoph Goller (@ChGoller)

Phone: +49 89 3090446-0

Fax: +49 89 3090446-29

Email: {christoph.goller,breno.faria}@intrafind.de

Web: www.intrafind.de

IntraFind Software AG

Landsberger Straße 368

80687 München

Germany

The persons graph and most screen-shots are copyright material of ZEIT Online.

Page 57: High quality, low maintenance content tagging … quality, low maintenance content tagging @ ZEIT Online Breno Faria, Christoph Goller

(c) 2014 I IntraFind Software AG 57

NOW -64d -32d -16d -8d -4d

n64

n64 – n32 n32

n32 – n16 n16

N spans N queries N-1 trends