NoTube: Models & Semantics

Monday, March 26, 2012

WP1 Overview

• “Backend” shared datasets and services• Mappings, integration and common vocabulary• Extra datasets to support usecase scenarios

2


WP1: Year 3 Direc2on & Achievements

• Moving from single ‘warehouse’ to distributed set of databases, datasets and services

• Planning for sustainable life-‐aFer-‐project• Integra2ng feedback from end-‐to-‐end demos

3


4


Why WP1? two roles

• NoTube internal: a hub for data sharing• NoTube external: show how shared datasets and vocabularies help with user-‐facing “Web and TV” problems

• “show” -‐cri2cally-‐ includes “thinking out loud” as we explore, via blog, email, twiTer etc.– scholarly ar2cles rarely reach our target audiences

5


Outreach message

• Let metadata flow widely -‐ adver2sing content, rather than be a hidden asset

• Iden/fy and link content with useful URLs(*)• Open APIs to control TV and link devices [WP7c]

6

...from W3C TV & Web position paper (with Project Baird), Berlin 9 Feb 2011

WP1 concerned primarily with the first two: getting metadata into the Web from source, rather than scraping, guessing, approximating.


Aside: RDFa went mainstream

• Try ‘View source’ on IMDB, RoTen Tomatoes, BBC, tv.com sites to find RDF descrip2ons of TV content.

• NoTube’s approach was to lead by example, to engage with industry and to plan from the beginning for the ‘aFerlife’.

• This strategy worked.

7


8

Facebook OGP

tv.com 'The Wire' page

...simple, extensible standards are being adopted

OGP since 2010; schema.org since 2011...


TV Data Warehouse

• We s2ll host several crawls of TV EPG data• Trend is for data to be more cleanly available from source, without scraping

• Crawling, aggrega2on and integra2on s2ll useful, but less scraping required

• Crawled 'data warehouse' also used as a research testbed collec2on

9


WP1: Example Datasets

• WP7c/WP3 use DBpedia/Wikipedia URLs for topics; covers all mainstream areas.

• BBC also using Lonclass/UDC topic codes (we’re helping prepare this for sharing)

• For Music, we adopt MusicBrainz IDs• Mapping diverse representa2ons of ‘genre’• “Organic” item/topic similarity measures derived from user data from WP3

10


WP1: Data Services

• Data Services exposed as sta2c files:– Show how to embed RDFa in HTML– Publish as RDF/XML Linked Data

• Interac2ve Data Services:– Using W3C SPARQL, SQL or SOLR/Lucene, over HTTP and/or XMPP.

11


WP1: Exploita2on and Sustainability

• WP1’s approach designed to outlive NoTube• Use, augment and contribute to external data

– e.g. DBpedia, Archive.org, W3C & wider Web of data trend (e.g. RDFa adop2on)

– also we demonstrate e.g. on blog how we did it -‐ so others can replicate it

– WP4 enrichments can be fed back to externals, e.g. similarity metrics & clusters

12


WP1: Sustainability 2• NoTube’s 2010 W3C “Web & TV” posi2on paper lobbied for unique IDs & public metadata for video content; this is now going mainstream.

• VUA will con2nue hos2ng some data, using PURL.org so can pass e.g. to W3C later.

• Collab with Facebook OGP (helped with their RDFa adop2on) and now search engine's Schema.org (RDFa and extending TV vocab).

13


14

schema.org


Workpackage Links

• Background data for all Workpackages• Collaborated with WP2 on BMF RDF models• Closer 2es throughout WP3/7 developments• WP4 en2ty and topic URIs point to WP1• Outreach work around RDFa, Posi2on Paper

15


2nd review comments

• Not clear though how this work has built upon the results of year 1, and how the current progress is in line with the case studies. – Worked more closely and pragma1cally with case studies in

WP7, especially 7c and related WP3 work. Moved towards more decentralised model, instead of 'warehouse'.

– 7c collabora1on with KMI's 'Watch and Buy' scenario, and with WP4 1med ad inser1on work, used EU p2pnext 'limo' work; also egtaMETA from EBU from 7c

– WP1 work became more "hands-‐on"; we helped WP7 extract datasets such as TED.com and Archive.org which we expect will shortly be replaceable by cleaner informa1on from 'official' sources.

16


2nd review comments

• No relevant state of the art is documented and no details or cita<ons on automated algorithms are given. Evalua<on is restricted to examples and no quan<ta<ve data are given.– We accept weakness in report (lack of scholarly/scien1fic detail); chose to focus on more informal communica1on with outside world in final phase. A 2nd version of the doc was produced, but main changes were around 'life aUer project' themes rather than adding more scien1fic and scholarly detail.

17


2nd review comments

• A close collabora5on with WP7 is recommended in order to ensure that work meets the requirements of the use cases.– this very well describes our emphasis in final phase

18


Lessons Learned

• It's hard to simulate an evolving global data ecosystem; but we've played a small part in some huge changes.

• Publishers will adopt simple Seman2c Web standards when they are given an incen5ve.

• It's hard for a 4-‐year old plan to stay relevant in such an environment; ability to be agile was cri2cally important.

19


WP1 Summary

• Used open standards (RDF) and largely open data (e.g. Wikipedia/DBpedia)

• Integrated, mapped and data-‐mined• Contribu1ng our addi1ons back to the community /

commons (highlight: BBC sims)• Documen1ng what we learned for external developers and

subsequent projects

20

Questions?


21


22


WP1: End-‐to-‐End issues

• In final year, our End-‐to-‐End scenarios have more mature implementa2ons

• Feedback from WP3/7c: key issue is sparsity of large vocabularies when used for record matching. No single solu2on here.

• Integra2ng techniques from WP4 (e.g. clustering, data-‐mining) cri2cal for applying large and chao2c vocabularies for prac2cal recommenda2ons.

23


NoTube: Models & Semantics

Technology

Transcript of NoTube: Models & Semantics