Download - ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

Transcript
Page 1: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

Dandelion: from raw data to dataGEMs for

developers

Stefano Parmesan

Tatiana Tarasova

Ugo Scaiella

Michele Barbera

Page 2: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

A bit of context

• SpazioDati s.r.l. • Italian startup: Pisa & Trento • Members of the DBpedia Association • Manage the italian DBpedia

Page 3: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

Goal

• Close the gap between getting the data and using it

• Build a Knowledge Graph as-a-service: • Make it querable • Make it stable, make it scale • Support different access levels

Page 4: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

How?

• Phase #1: PUT the data in • Data normalization • Entity deduplication

• Phase #2: GET the data out • Slices

Page 5: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

How?

Data Normalisation Entity Deduplication Data Storage Data Access

Raw Data

Sample

Reconciliation Services

Source 1

Source N

Azkaban SilkFramework Titan Graph dandelion.eu

Linked Data

Slices

dataGEM

Page 6: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

Why…

• … slices? • SQL-like APIs • Common knowledge, linked data

• … a graph at all? • Traversals • Data is centralized • Different sources, different access levels

Page 7: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

Why…

• … titan/gremlin? • Scalable • Richer (multi-prop, undef-depth queries) • OpenSource • ElasticSearch powered

Page 8: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

And now what?

• Still a prototype: • Private beta access to slices (demo) • English and italian DBpedia • Corporate private data

Page 9: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

Future?

• Phase #1b: PUT the data in • Scalable entity deduplication

• Phase #2b: GET the data out • API for graph traversal • Text analysis tools (dataTXT) • Customizations

Page 10: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

RDF mappings

<http://data.spaziodati.eu/resource/ef4a83008d4ffd9e85c1fcf6dfe59c8758226cdb> a code:ISTATAdministrativeDivision ; sd:childOf <http://data.spaziodati.eu/resource/7b7d45857f1372e1205bcfc87c19b2b2db2e0f59> ; sd:code "001001" ; sd:acheneID "ef4a83008d4ffd9e85c1fcf6dfe59c8758226cdb" ; code:cadastralCode "A074" ; sd:label "Agliè" ; code:elevation "315"^^xsd:int ; code:isCoastal "false"^^xsd:boolean ; code:isMountainous "false"^^xsd:boolean ; sd:level "60"^^xsd:int . !_:node194hhq904x1 rdf:subject <http://data.spaziodati.eu/resource/ef4a83008d4ffd9e85c1fcf6dfe59c8758226cdb> ; rdf:predicate code:population ; rdf:object "2574"^^xsd:int ; sd:acheneID "31e4104e62168ffc4c3d6d278ecc775effff6ebc" ; metaprop:validSince "2001-10-21"^^xsd:date . !_:node194hhq904x2 rdf:subject <http://data.spaziodati.eu/resource/ef4a83008d4ffd9e85c1fcf6dfe59c8758226cdb> ; rdf:predicate code:population ; rdf:object "2644"^^xsd:int ; sd:acheneID "f38e87252cc5614faeec4abbeedd6315f5d00e9f" ; metaprop:validSince "2011-10-09"^^xsd:date .

Page 11: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

Graph structure

Provenance nodes

Type nodes

Bristle node

Achene node

Page 12: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

Traversing

• v.as(‘x’).out(‘sd:childOf’) .loop(‘x’){ cur -> cur.outE(‘sd:childOf’).hasNext() }.path()

Page 13: ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

Stefano Parmesan [email protected]