Iswc 2014-hammond-pasin-presentation-final

15
LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for scientific and scholarly content on top of a semantic data model 22 October 2014 Tony Hammond Michele Pasin

description

Talk for ISWC 2014 (Industry Track) by Tony Hammond and Michele Pasin on October 22, 2014 at Riva del Garda, Italy: 'Linked data experience at Macmillan: Building discovery services for scientific and scholarly content on top of a semantic data model'

Transcript of Iswc 2014-hammond-pasin-presentation-final

Page 1: Iswc 2014-hammond-pasin-presentation-final

LINKED DATA EXPERIENCE AT MACMILLANBuilding discovery services for scientific andscholarly content on top of a semantic data model

22 October 2014

Tony Hammond

Michele Pasin

Michele Pasin
this one will be removed
Page 2: Iswc 2014-hammond-pasin-presentation-final

Linked Data at Macmillan | 22 October 2014

1

Background

About Macmillan and what we are doing

Page 3: Iswc 2014-hammond-pasin-presentation-final

Macmillan Science and Education

Linked Data at Macmillan | 22 October 2014

Group brands and businesses

Page 4: Iswc 2014-hammond-pasin-presentation-final

MS&E Current trends

Change Drivers

●Digital first workflow

– print becomes secondary

– support for multiple workflows

●User-centric design

– things, not data

– focus on user experience

●Deeply integrated datasets

– standard naming convention

– common metadata model

– flexible schema management

– rich dataset descriptions

Linked Data at Macmillan | 22 October 2014

Developing a richer graph of objects

Page 5: Iswc 2014-hammond-pasin-presentation-final

NPG Linked Data Platform (2012)

Deliverables (2012–2014)

●Prototype for external use

●Two RDF dataset releases in 2012

– April 2012 (22m triples)

– July 2012 (270m triples)

●Live updates to query endpoint

●SPARQL query service (decommissioned)

Current Work (2014–)

●Focus on internal use-cases

●Publish ontology pages

●Periodic data snapshots

Linked Data at Macmillan | 22 October 2014

data.nature.com

Page 6: Iswc 2014-hammond-pasin-presentation-final

NPG Core Ontology (2014)

Features

●Classes: ~65

●Properties: ~200

●Named graphs (per class)

Namespaces

●npg: => http://ns.nature.com/terms/

●npgg: => http://ns.nature.com/graphs/

Approach

●Incremental formalization (RDF, RDFS, OWL-DL)

●Shared metamodel vs. automatic inference

●Minimal commitment to external vocabs

Linked Data at Macmillan | 22 October 2014

Things: assets, documents, events, types

Page 7: Iswc 2014-hammond-pasin-presentation-final

NPG Subject Pages (2014)

Features

●Based on SKOS taxonomy

– >2500 scientific terms

– content inherited via SKOS tree

●Dynamically generated

– one webpage per subject term

– secondary pages for article types

●Various formats, e.g. e-alerts, feeds

– allows people to ‘follow’ a subject

●Customized related content

– ads, jobs, events, etc.

Linked Data at Macmillan | 22 October 2014

Topical access to content

Page 8: Iswc 2014-hammond-pasin-presentation-final

Linked Data at Macmillan | 22 October 2014

2

Data Storage and Query

Achieving speed by means of a hybrid architecture

Page 9: Iswc 2014-hammond-pasin-presentation-final

Content Hub

Capabilities

●Discovery – Graph

●Storage – Content Repos

Features

●Hybrid RDF + XML architecture

– MarkLogic for XML, RDF/XML

– Triplestore (TDB) for RDF validation

●Repo’s for binary assets

Datasets

●Documents (large; >1m)

●Ontologies (small; <10k)

Linked Data at Macmillan | 22 October 2014

Managed content warehouse for data discovery

Page 10: Iswc 2014-hammond-pasin-presentation-final

System Architecture

Linked Data at Macmillan | 22 October 2014

Hub content

Page 11: Iswc 2014-hammond-pasin-presentation-final

Content Discovery – Principles

Generations

●1st – Generic linked data API (RDF/*)

●2nd – Specific page model API (JSON)

Concerns

●Speed (20ms single object; 200ms filtered object)

●Simplicity (data construction)

●Stability (backup, clustering, security, transactions)

Principles

●Chunky not chatty, all data in a single response

●Data as consumed, rather than as stored

●Support common use cases in simple, obvious ways

●Ensure a guaranteed, consistent speed of response for more complex queries

●Build on foundation of standard, pragmatic REST (collections, items)

Linked Data at Macmillan | 22 October 2014

Readying the API for applications

Page 12: Iswc 2014-hammond-pasin-presentation-final

Content Discovery – Optimization

Approaches

●TDB + Fuseki – SPARQL

●MarkLogic Semantics – SPARQL

●MarkLogic – XQuery

●MarkLogic (Optimized) – XQuery

Techniques

●Partitioning – RDF/XML objects

●Streaming – serialization

●Hashing – dictionary lookup

●Cacheing – Varnish

Linked Data at Macmillan | 22 October 2014

Tuning the API for performance

Page 13: Iswc 2014-hammond-pasin-presentation-final

Content Storage – Layout and Indexing

Challenges

●Sort orders

●RDF Lists

●Facetting, counting

Layout

●Semantic RDF/XML includes in XML

●RDF objects serialized in list order

●Application XML for subject hierarchy

Indexes

●Indexes over all elements

●Range indexes for datatypes (e.g. datetimes)

Linked Data at Macmillan | 22 October 2014

Readying the data for page delivery

Page 14: Iswc 2014-hammond-pasin-presentation-final

In Conclusion

Summary

●An RDF metamodel allows for scalable enterprise-level data organization

●It is crucial to adequately distinguish between external and internal use cases

●A hybrid architecture proved to be an efficient internal solution for content delivery

Future Work

●Grow the ontology so that it matches product requirements more closely

●Support automated reasoning and richer query options – both RDF and XML based

●Maintain and expand the vision of a shared semantic model as a core enterprise asset

Linked Data at Macmillan | 22 October 2014

A few lessons learned

Page 15: Iswc 2014-hammond-pasin-presentation-final

For more information please contact

TONY HAMMONDData Architect, Content Data [email protected]

MICHELE PASINInformation Architect, Product [email protected]

Thank you