Producing, publishing and consuming linked data - CSHALS 2013

57
Producing, Publishing and Consuming Linked Data Three Lessons from the Bio2RDF Project François Belleau Centre de recherche du CHUQ, Laval University Québec, Canada @bio2rdf

description

Bio2RDF presentation at CSHALS 2013 in Boston.

Transcript of Producing, publishing and consuming linked data - CSHALS 2013

Page 1: Producing, publishing and consuming linked data - CSHALS 2013

Producing, Publishing and Consuming

Linked Data Three Lessons from the Bio2RDF Project

François Belleau

Centre de recherche du CHUQ, Laval University

Québec, Canada

@bio2rdf

Page 2: Producing, publishing and consuming linked data - CSHALS 2013

• Looking backward to 2004

• Lessons :

1) How to produce RDF

2) How to publish Linked Data

3) How to consume SPARQL endpoints

• Looking forward for the next decade

Page 3: Producing, publishing and consuming linked data - CSHALS 2013

The story of two images

or Bio2RDF fairy tale

2004 vision 2011 reality

Page 4: Producing, publishing and consuming linked data - CSHALS 2013

Rdfizer inspiration

Page 5: Producing, publishing and consuming linked data - CSHALS 2013

Data Integration problem in

bioinformatics

Page 6: Producing, publishing and consuming linked data - CSHALS 2013

Where Bio2RDF got its name

Page 7: Producing, publishing and consuming linked data - CSHALS 2013

Mashup !

FungalWeb

from Christopher Baker

YeastHub

from Kei-Hoi Cheung

Page 8: Producing, publishing and consuming linked data - CSHALS 2013

ISMB 2005 Birds of a Feather

Page 9: Producing, publishing and consuming linked data - CSHALS 2013

W3C conference in 2007 46 millions documents in SESAME

Page 10: Producing, publishing and consuming linked data - CSHALS 2013

DILS conference in 2008 63 millions triples in Virtuoso

Page 11: Producing, publishing and consuming linked data - CSHALS 2013

ISMB conference in 2008 65 millions triples in Virtuoso

Page 12: Producing, publishing and consuming linked data - CSHALS 2013

March 2009 Linked Data cloud is

published

Bio2RDF 2,3 billions

triples represents 54%

of the global graph

Page 13: Producing, publishing and consuming linked data - CSHALS 2013

W3C-HCLS F2F Meeting in 2009 41 in Virtuoso endpoints

Page 14: Producing, publishing and consuming linked data - CSHALS 2013

CSHALS conference in 2013 1 billions triples in 19 Virtuoso endpoints with

Bio2RDF release 2 and still adding…

Page 15: Producing, publishing and consuming linked data - CSHALS 2013

Bio2RDF is not alone anymore !

Page 16: Producing, publishing and consuming linked data - CSHALS 2013

How to produce RDF

• Bio2RDF project transform existing public

database into RDF;

• Data format transformation to RDF triples is

simple to do;

• Transformation need to be done from many

kind of format (CSV, XML, JSON, HTML,

relational database) to RDF.

Page 17: Producing, publishing and consuming linked data - CSHALS 2013

Methods

• 2006 Converting XML and HTML document

from the web using JSP JSTL library

• 2007-2010 Perl scripts, JSP web pages

• 2012 – Release 2.0 rdfiser are written in PHP

• 2013 – Use Talend ETL job

Page 18: Producing, publishing and consuming linked data - CSHALS 2013

ETL definition from Wikipedia

In computing, Extract, Transform and Load

(ETL) refers to a process in database usage

and especially in data warehousing that

involves:

Extracting data from outside sources

Transforming it to fit operational which can

include quality levels)

Loading it into the end target (database, more

specifically, operational data store, data mart

or data warehouse)

http://en.wikipedia.org/wiki/Extract,_transform,_load

Page 19: Producing, publishing and consuming linked data - CSHALS 2013

Why not use ETL software

to rdfize existing data ?

Page 20: Producing, publishing and consuming linked data - CSHALS 2013

Talend Open Studio for Data Integration an open source free ETL software build with Eclipse

http://www.talend.com/

Page 21: Producing, publishing and consuming linked data - CSHALS 2013

HGNC 2 Bio2RDF example

EXTRACT from the web

TRANSFORM to RDF

LOAD into triplestore

Page 22: Producing, publishing and consuming linked data - CSHALS 2013

HGNC 2 Bio2RDF : EXTRACT

Page 23: Producing, publishing and consuming linked data - CSHALS 2013

HGNC 2 Bio2RDF : TRANSFORM

Page 24: Producing, publishing and consuming linked data - CSHALS 2013

HGNC 2 Bio2RDF : LOAD

Page 25: Producing, publishing and consuming linked data - CSHALS 2013

This rdfizer is available on myExperiment

http://www.myexperiment.org/workflows/3420.html

Page 26: Producing, publishing and consuming linked data - CSHALS 2013

Lesson #1

• Use existing ETL tool, like Talend, to do fast

and efficient transformation to RDF n-triples

format.

• Talend could be extended with new Semantic

web components to ease RDF transformation

and simplify SPARQL query submission.

Page 27: Producing, publishing and consuming linked data - CSHALS 2013

How to publish

Linked Data • Design your URI pattern;

• Publish SPARQL endpoint on the Internet;

• Offer a search engine and a browser;

• Register it to official registry like CKAN;

• Advertise it in SPARQL endpoint list;

• Describe your triples with an ontology or the way

Bio2RDF does;

• Publish SPARQL query example;

• Index your data in semantic search service like Sindice;

Page 28: Producing, publishing and consuming linked data - CSHALS 2013

Design your URI pattern

• Bio2RDF use Banff manifesto URIs

• http://sourceforge.net/apps/mediawiki/bio2rdf/index.p

hp?title=Banff_Manifesto

• Example : http://bio2rdf.org/geneid:15275

• Apply the four linked data rules • http://www.w3.org/DesignIssues/LinkedData.html

• Be polite with other URIs • http://hackathon3.dbcls.jp/wiki/URI

• Example : http://purl.uniprot.org/uniprot/P05067

Page 29: Producing, publishing and consuming linked data - CSHALS 2013

Publish SPARQL endpoint on the

Internet

• Choose a triplestore technology

• http://en.wikipedia.org/wiki/Triplestore

Page 30: Producing, publishing and consuming linked data - CSHALS 2013

Offer a search engine and a browser

Page 31: Producing, publishing and consuming linked data - CSHALS 2013

Register it to official registry like

CKAN

Page 32: Producing, publishing and consuming linked data - CSHALS 2013

Advertise it in SPARQL endpoint list

http://www.freebase.com/view/base/politeuri/sparql_endpoint

http://beta.bio2rdf.org/

Page 33: Producing, publishing and consuming linked data - CSHALS 2013

Describe your triples

Page 34: Producing, publishing and consuming linked data - CSHALS 2013

Publish SPARQL query example

http://sourceforge.net/apps/mediawiki/bio2rdf/index.php?title=Essential_SPARQL_queries

Page 35: Producing, publishing and consuming linked data - CSHALS 2013

Index your data in semantic search

service

Page 36: Producing, publishing and consuming linked data - CSHALS 2013

Lesson #2

• To be present in the Linked Data cloud, just

publish your data through a SPARQL

endpoint.

• Register it to public resources, describe its

content and suggest SPARQL queries.

• We use OpenLink Virtuoso free edition since

2007. Without this first class triplestore

software there would not be a Bio2RDF

service.

Page 37: Producing, publishing and consuming linked data - CSHALS 2013

How to consume

SPARQL endpoints

Two principles :

1. To answer a specific question first build a

mashup using public or private SPARQL

endpoints.

2. Then, ask your questions to the mashup.

Page 38: Producing, publishing and consuming linked data - CSHALS 2013

How to build a semantic mashup

• 2005 - Import RDF file in Protégé.

• 2006 - Use ELMO RDF crawler to import RDF

data into SESAME triplestore.

• 2007 - We implement a import function in

SESAME based on derefencable URIs.

• 2008 - Use Virtuoso sponge option and Perl

scripts.

• 2009 - Use Taverna workflow engine to fetch

triples from SPARQL endpoint.

• 2012 Use a Talend workflow consuming

SPARQL endpoint.

Page 39: Producing, publishing and consuming linked data - CSHALS 2013

Who is influential at CSHALS ?

http://cshals.mashup.bio2rdf.org/relfinder/

http://cshals.mashup.bio2rdf.org/sparql

Page 40: Producing, publishing and consuming linked data - CSHALS 2013

Talend workflow to create the

needed semantic mashup

• Do a full text search for each author (~80)

who talked at CSHALS since 2007 and get

its publication;

• For each publication get its XML

description (~1000) and rdfize it;

• For each publication get its citation list;

• For each publication citing a previous one

get its description (~10 000).

Page 41: Producing, publishing and consuming linked data - CSHALS 2013

Global workflow in 3 steps

Full text search

Describe publication

Describe citing

publication

Page 42: Producing, publishing and consuming linked data - CSHALS 2013

Full text search using ncbi/esearch

Page 43: Producing, publishing and consuming linked data - CSHALS 2013

Describe publication, pubmed rdfizer for

ncbi/efetch and ncbi/elink service

Page 44: Producing, publishing and consuming linked data - CSHALS 2013

Describe citing publication using

ncbi/elinks

Page 45: Producing, publishing and consuming linked data - CSHALS 2013

Then query the mashup

• What is CSHALS conference about ?

• Who are the most influential researchers in

the community ?

• Which articles in semantics as been mostly

cited ?

Page 46: Producing, publishing and consuming linked data - CSHALS 2013

What is CSHALS conference about ?

select ?label2 as ?mesh count(*) as ?count

where {

?s <http://bio2rdf.org/pubmed_vocabulary#xFoundIn> ?pubmed .

?pubmed <http://bio2rdf.org/pubmed_vocabulary#xMesh> ?xMesh .

?xMesh rdfs:label "Semantics" .

?pubmed <http://bio2rdf.org/pubmed_vocabulary#xMesh> ?xMesh2 .

?xMesh2 rdfs:label ?label2 .

}

order by desc(2)

Page 47: Producing, publishing and consuming linked data - CSHALS 2013

Who are the most influential

researchers in the community ?

select ?l3 as ?author count(distinct ?pubmed ) as ?citation

where {

?s a <http://bio2rdf.org/pubmed_vocabulary#searchResults> .

?s rdfs:label ?l .

?s <http://bio2rdf.org/pubmed_vocabulary#xFoundIn> ?pubmed .

?pubmed <http://bio2rdf.org/pubmed_vocabulary:xCitedIn>

?xCitedIn .

?pubmed rdfs:label ?l2 .

?pubmed <http://bio2rdf.org/pubmed_vocabulary#xMesh> ?xMesh .

?xMesh rdfs:label "Semantics" .

?pubmed <http://bio2rdf.org/pubmed_vocabulary#xPerson> ?xPerson .

?xPerson rdfs:label ?l3 .

}

order by desc(2)

Page 48: Producing, publishing and consuming linked data - CSHALS 2013

Which articles in semantics has

been most cited ? select ?l2 as ?title count(?xCitedIn) as ?count

where {

?s a <http://bio2rdf.org/pubmed_vocabulary#searchResults> .

?s rdfs:label ?l .

?s <http://bio2rdf.org/pubmed_vocabulary#xFoundIn> ?pubmed .

?pubmed <http://bio2rdf.org/pubmed_vocabulary:xCitedIn> ?xCitedIn .

?pubmed rdfs:label ?l2 .

?pubmed <http://bio2rdf.org/pubmed_vocabulary#xMesh> ?xMesh .

?xMesh rdfs:label "Semantics" .

} order by desc(2)

Page 49: Producing, publishing and consuming linked data - CSHALS 2013

What is the relation between François

Belleau and Michel Dumontier ?

Page 50: Producing, publishing and consuming linked data - CSHALS 2013

Using RelFinder

http://www.visualdataweb.org/relfinder.php

http://cshals.mashup.bio2rdf.org/relfinder

Page 52: Producing, publishing and consuming linked data - CSHALS 2013

Gruff for AllegroGraph

http://www.franz.com/agraph/gruff/

Page 53: Producing, publishing and consuming linked data - CSHALS 2013

Lesson #3

• To answer a specific question build a mashup

from SPARQL endpoints and query it.

• To build your semantic mashup, use a

workflow which can be created with an ETL

like Talend.

• Explore the mashup with semantic software

like Virtuoso faceted browser, RelFinder,

Gruff or Sentient.

Page 54: Producing, publishing and consuming linked data - CSHALS 2013

Projects

• Add new data source to Bio2RDF collection

of SPARQL endpoints;

• Develop Talend ETL Semantic web extension

to ease rdfizing and SPARQL endpoint

consumption needed to build mashup;

• Create a mobile application to browse

Bio2RDF or other SPARQL data sources.

Page 55: Producing, publishing and consuming linked data - CSHALS 2013

Looking forward foir the next decade

• More data provider will expose their data as SPARQL endpoints,

but Bio2RDF is still needed.

• Now that Data has been converted to RDF (a dirty job) we need

to ask useful question to the Linked Data cloud (a hard one).

SPARQL query will not be sufficient and reasoner will be

essential.

• Semantic software for browsing, visualisation, edition will be

created and SPARQL federated query engine will become

available. This will be the next game changer.

• Intuitive mobile applications will give access to Semantic web

data in a user friendly manner.

• Data Integration experience will be successful for scientist user, if

our enthusiast community get organize, so governance for Linked

Data in Life Science is a major issue.

Page 57: Producing, publishing and consuming linked data - CSHALS 2013

Acknowledgements

• Bio2RDF is a community project available at http://bio2rdf.org

• The community can be joined at

https://groups.google.com/forum/?fromgroups#!forum/bio2rdf

• This work was done under the supervision of Dr Arnaud Droit,

assistant professor and director of the Centre de Biologie

Computationnelle du CRCHUQ at Laval University, where Bio2RDF

is hosted.

• Michel Dumontier, from the Dumontier Lab at Carleton University, is

also hosting Bio2RDF server and his team created new release 2.

• Thanks to all the people member of the Bio2RDF community, and

especially Marc-Alexandre Nolin and Peter Ansell, initial developers.

• This work was supported by Ministère du Développement

Economique, Innovation Exportation (MDEIE).