DBpedia Linker

Post on 28-Nov-2014

5.379 views 4 download

description

Interlinking BBC CIS concepts with DBpedia Learn more about our work on http://mes-semantics.com

Transcript of DBpedia Linker

Christian Becker: DBpedia LinkerLondon. September 4, 2008

Christian BeckerChris Bizer

Georgi Kobilarov

Freie Universität Berlin

DBpedia Linker

Interlinking CIS concepts with DBpedia

Christian Becker: DBpedia Linker

Hello

Name Christian Becker

Job Partner, MES (Consulting on media-centric solutions)

PhD Student at Freie Universität Berlin

Semantic Web Projects DBpedia’s Geo and Homepage Extractors

DBpedia Mobile and Marbles Browser

flickr™ wrappr

Christian Becker: DBpedia Linker

DBpedia/Wikipedia as a Common Vocabulary

Better link between BBC properties

Better link externally

Better find and integrate BBC content elsewhere; leverage BBC metadata

Christian Becker: DBpedia Linker

Better link between BBC properties

Christian Becker: DBpedia Linker

Better link externally

BBC properties can be enriched with information from Wikipedia articles as well as content connected to them

Christian Becker: DBpedia Linker

Better find and integrate BBC content elsewhere

Christian Becker: DBpedia Linker

DBpedia

Programmes Music

Topics

Users

Events

News Food

Gardening

Christian Becker: DBpedia Linker

Christian Becker: DBpedia Linker

BBCProgrammes

BBCTopics

BBCMusic ✔

DBpedia

TODAY!

Music-brainz

BBCNews etc.

FUTURE

FUTURE

Christian Becker: DBpedia Linker

BBC Topics: CIS Taxonomy

Core datasets 6,630 brands

55,943 locations

55,943 names

11,231 subjects

Preferred and alternative labels

Tree hierarchy expressed in SKOS

Implicit hierarchy in parentheses texts, e.g. Jane Seymour (actor)

Christian Becker: DBpedia Linker

Results

Total Linked Precision* Recall*

Brand

Location

Name

Subject

6,630 1,267 (19%) 86% 41%

55,943 11,316 (20%) 99% 77%

73,442 22,341 (30%) 92% 67%

11,231 6,822 (61%) 92% 75%

* Against test set of 600 resources. Updated to reflect only cases where links are possible.

Christian Becker: DBpedia Linker

Why so few links...?

Many concepts simply don’t have their own Wikipedia articles Brands

- “Mind the baby, Mr Bean” is in Wikipedia’s “List of Mr. Bean episodes”- “Face to face (BBC Radio Gloucestershire community

programme)” (not the BBC TV Series!)

Locations- “West Woods (Wiltshire)”- “Hobhole Drain” (notable mention in “List of rivers of England”)- “Hinchingbrooke Country Park”

Names- “The Jolly Anker (pub, Northampton)”- “Moulton Players (drama group)”- “Halliwell, Jo (BBC Leeds volunteer for Fat Nation)”

Subjects- “Agricultural Statistics”

We think that important concepts are largely linked!

Christian Becker: DBpedia Linker

Linking Approach

Automated linking: Tradeoff between quality and quantity

We wanted highly qualitative links

Limited input - only labels and hierarchy

Problems No correspondences

Differing labels- Word stemming- Determining term nearness using Lucene’s scorer- Integrating Wikipedia redirects to find alias labels

Ambiguities- Sorting by number of inter-wiki references- DBpedia class restrictions - Class Equivalence- Require exact matches

Christian Becker: DBpedia Linker

Poor man’s PageRank

Bill Clinton

30000

...Democratic

Party

Hillary Clinton

United States

List of United States

Presidents

Lucene boost factor = Number of article from which an article is referenced

Christian Becker: DBpedia Linker

Integrating Redirects

Bill Clinton

30000

William Blythe III

200

Buddy (Clinton's

dog)

5

Putting People First

100

Redirects serve as alias labels. Their references count towards the redirection target.

“Brand” category set

Christian Becker: DBpedia Linker

Class Restrictions

imdb_title

“Mary (1985 sitcom)” = ?

Mary (Holy Mother)

50000

Something about Mary

5000

The Mary Tyler Moore

Show

1000

Mary (1985 series)

500

Infobox album

Infobox television

Black and white films

...

Christian Becker: DBpedia Linker

Class Equivalence

Mary (1985 sitcom) 1985

tv brand Infobox television

1980s American television

series

(15 more)

BBC CIS DBpedia

sitcom

Something about Mary

The Mary Tyler Moore

Show

Mary (1985 series)

Lucene query:((+mary 1985 sitcom )) AND ((categories:Category\:1985_television_series_debuts))

Christian Becker: DBpedia Linker

Class Equivalence

About 5% boost in precision and recall (after class restrictions and exact matching)

Algorithm Enrich class hierarchy using parentheses texts

Perform label-based lookup on all items in the dataset and memorize result candidates

Rank CIS classes against DBpedia classes

Perform label lookup restricting results to top 5,10,15% class equivalences; excluding the overall top 20% classes

Christian Becker: DBpedia Linker

Class Equivalence

Mary (1985 sitcom) 1985

tv brand Infobox television

1980s American television

series

(15 more)

BBC CIS DBpedia

sitcom

Something about Mary

The Mary Tyler Moore

Show

Mary (1985 series)

Christian Becker: DBpedia Linker

The Linkage tool

Written in Java, uses Lucene indexes prepared in C#

Command line interface with link and benchmark modes

Components Apache Lucene search

OpenRDF Sesame (native storage)

Dataset-specific algorithm choice and parameters

Next step: General Linking Interface

Christian Becker: DBpedia Linker

Future Directions

Improve quality / quantity Text-level comparison of content relating to the CIS concepts with

Wikipedia articles

Manual review based on confidence score

General Interlinking Framework Describe input data

Select algorithms

Link!

Add non-existant resources to DBpedia Wikipedia requires qualitative content according to Wikipedia

Guidelines

Idea: A “Minipedia” that serves as an additional source to DBpedia

Christian Becker: DBpedia Linker

Thanks!

Questions?