Post on 14-Jul-2015
DaPaaS: Enabling Low-cost Open Data Publishing and Reuse
@ Data Summit Brussels
March 5th, 2015
http://dapaas.eu/
Marin Dimitrov, Ontotext, Bulgaria
Amanda Smith, Open Data Institute, UK
Open Data Benefits
• Businesses can develop new ideas, services and applications; improve decision making, cost savings
• Can increase government transparency and accountability, quality of public services
• Citizens get better and timely access to public services
2 Source: McKinsey http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information
Gartner:
By 2016, the use of "open data" will continue to
increase — but slowly, and predominantly limited to
Type A enterprises.
By 2017, over 60% of government open data
programs that do not effectively use open data
internally, will be scaled back or discontinued.
By 2020, enterprises and governments will fail to
protect 75% of sensitive data and will declassify and grant broad / public access to it.
Source: Garner http://training.gsn.gov.tw/uploads/news/6.Gartner+ExP+Briefing_Open+Data_JUN+2014_v2.pdf
Lots of open datasets on the Web…
• A large number of open datasets published in the recent years
• Various domains: cultural heritage, science, finance, statistics, transport and smart cities, environment, …
• Various formats: tabular (e.g. CSV, XLS), HTML/XML, JSON, LOD, Web APIs…
3
…but few actually used
• Few applications utilizing open and distributed datasets at present
• Challenges for data consumers
– Data quality issues
– Difficult or unreliable data access
– Licensing issues
• Challenges for data publishers
– Lack of expertise & resources: not easily to publish & maintain high quality data
– Unclear monetization & sustainability
4
Open Data Portal Datasets Applications
data.gov ~ 110 000 ~ 350
publicdata.eu ~ 50 000 ~ 80
data.gov.uk ~ 20 000 ~ 350
data.norge.no ~ 300 ~ 40
Open Data is mostly tabular data
– Records organized in silos of collections
– Very few links within and/or across collections
– Difficult to understand the nature of the data
– Difficult to integrate / query
5
Tabular datasets
publicdata.eu data.gov.uk
Linked Data is great for Open Data
• Linked Data as a great means to represent and integrate disparate and heterogeneous open data sources
• How Linked Data can improve Open Data:
– Easier integration, free data from silos
– Seamless interlinking of data
– Understand the data
– New ways to query and interact with data
• Challenges with using Linked Data
– Lack of tooling & expertise to publish high quality Linked Data
– Lack of resources to host LOD endpoints / unreliable data access
6
DaPaaS: making Open (Linked) Data easier to use
• A data hosting platform: to make it easy for publishers to put data on the web
• A data portal: to help advertising data availability
• Data transfomraiton tools to make it easier to publish large amounts of high quality data
• Open source tools with high-quality documentation
7
Make Linked Data more accessible to everyone!
Key enablers
8
Grafter Grafterizer (Graphical Tool & DSL)
RDF database-as-a-service
Open Data Portal
+ PLUQI
Grafter
• Grafter is a DSL and a suite of tools for data transformation & cleaning
• Primarily used for handling data conversions from:
– tabular data formats to tabular data formats
– tabular data formats to RDF
• “lazy” / stream processing, no need to load whole dataset
• Robust & efficient for large scale processing
• Transformations can be packaged as REST services
• Open Source (EPL)
– http://github.com/swirrl/grafter
– http://grafter.org/
9
Tabular data (spreadsheet) to RDF Linked Data (graph)
1. Define a pipeline of tabular transformations for data cleaning and transformation.
2. Create the graph fragments resulting in the generation of an RDF graph.
10
Grafterizer
• GUI tool for the Grafter suite
• Open Source (EPL)
– github.com/dapaas/grafterizer
12
Use Case: Transformation and Mapping to RDF
• Import raw data
• Clean up and transform using Grafter / Grafterizer
• Define ontology mapping using Grafterizer
• Generate the RDF graph
Transform Generate
RDF
Ontology X Ontology X
Ontology X
Ontology mapping
RDF Graph
Raw Data
Prepared Data
Map
Map
RDF database-as-a-service
• Enables live data services, instead of static datasets
– A new RDF database can be operational within seconds
• Automated backups, operations, maintenance
• Based on an enterprise-grade RDF database
– Linked Data Fragments servers to be deployed too
• Designed for scalability & availability, in the cloud
• Data import services (Grafter pipelines)
14
Summary
• Open Data has big potential for governments, enterprises and citizens
• Lots of open datasets available, but very few actually used
• Linked Data is a promising technology for Open Data, but still difficult to use for publishers and application developers
• DaPaaS – enabling low-cost Open (Linked) Data publishing and reuse
– Platform, portal, methodology, APIs
– Repeatable and scalable data transformations
– Scalable Linked Data hosting 15