Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
-
Upload
kingsley-uyi-idehen -
Category
Technology
-
view
2.993 -
download
0
description
Transcript of Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
By Orri ErlingVirtuoso Program Manager
OpenLink Software
Virtuoso: The Prometheus of
RDF-based Relational Data Management
Linked Data at Dawn The Promise and the Practice The Science of Speed The Structure which Is Ongoing Research
License CC-BY-SA 4.0 (International).
Linked Data Promises
RDF is a generic, minimalistic model for describing things
RDF has global identifiers and data is self-describing
URIs may be dereferenceable
RDF is flexible to query, does not force a single hierarchical view like XML
License CC-BY-SA 4.0 (International).
Linked Data Scenarios
RDF is used because of
schema flexibility
global identifiers
Inference, if present, is usually trivial
Subclass
Sub-property
License CC-BY-SA 4.0 (International).
Where Triples Come From
Relational extracts or web content is converted to and stored as triples
NLP extraction
New applications with RDF as primary data model
Doing SPARQL against data in RDBs is possible but is rare and does not deliver the flexibility
License CC-BY-SA 4.0 (International).
Linked Data Verticals and Patterns
Publishing: tagging & annotations, evolving vocabularies
Archives: self description, long term identifiers, many versions of schema
Semantic search: structured, semi-structured, and full text, all in one
Business intelligence: many sources, ease of adding sources, no 6 month DW schema change cycle
E-science, often in life sciences: common interchange format, nano-publications, NLP extracts, different users cook their data differently, provenance
License CC-BY-SA 4.0 (International).
The Hopes and Perceptions
The age of ad hoc
Find insight in any data, when you need it, from any source, any format
No data warehouse planning cycles; make your own from the pieces you need, when you need it
Still, data integration remains hard work; quality and coverage of sources vary
Flexibility may be there, but is performance and scalability on the level?
License CC-BY-SA 4.0 (International).
Yes, But ...
Web and Big Data: Everybody reinvents the triple. Self-description, long term identifiers, key-value pairs in many non-RDF use cases
SPARQL and RDF would be the natural, standards-compliant choice if did beat SQL, information retrieval, custom big data, key value, map reduce solutions
Is this intrinsic to linked data or is this lack of engineering?
Linked data has unique advantages in breadth of coverage and expressivity but performance must not lag behind.
License CC-BY-SA 4.0 (International).
What is the RDF Tax?
90% of bad performance comes from non-optimal query plans
Some comes from indexing too much (e.g., SQL bulk load with no indices is 50x faster than the equivalent in RDF with all indexed)
Some comes from string ops on URIs, literals
Some comes from having a join for every attribute. Vectoring and right plans help, though
License CC-BY-SA 4.0 (International).
The Bane of the TripleWhen data is stored as triples:
There is structure still but it is harder to exploit. Schema re-emerges as correlations
More joins make more possible query plans, bigger errors in plan cost estimation
More joining reduces locality
Lack of schema causes needless indexing; data takes more space
A URI for everything takes space and time
For the same workload, Virtuoso SQL can also be 2–20x faster than Virtuoso SPARQL
License CC-BY-SA 4.0 (International).
The Question is Raised
LOD2 FP7, now ending: RDF Performance parity with relational?
SQL is the senior science. Who ignores history is bound to repeat it
Integral mastery of RDB science is a prerequisite, but do not forget the subtle twists of schema-less-ness
License CC-BY-SA 4.0 (International).
Virtuoso RDF Relational DBMS Leadership
2000–2006, v1.x–4.x: SQL row store with SQL federation and XML
2007–2008, v5.x–6.x: SPARQL, adapted for RDF quads with more compression, bitmap indices, special data types, RDF awareness in query optimization
2009, v6.x: Scale-out cluster-capable
2010–2013, v7.x: Column store, vectored execution, 3x more space efficient, 10+x more speed
2013: Star Schema benchmark with SPARQL, 100x MySQL SQL, 0.8x MonetDB SQL
2014: Top of the line SQL analytics, 500 Gtriples, Structure Awareness
License CC-BY-SA 4.0 (International).
Triples Done Right, so?
Column-store techniques are a good fit; index-based triple storage does not get much better
RAM-only pointer-based techniques can be faster but cost 10–100x more to scale up
To take RDF to SQL parity, Virtuoso must first be on the level with the best in SQL
TPC-H is the checklist for mastery of DW and query optimization; who survives shall not fear
Parity is achieved when running with triples, just like with tables
License CC-BY-SA 4.0 (International).
Structure is Everywhere
CWI in LOD2:
90% of triples in Common Crawl fall into 20 tables
All relational extractions are 100% tables
Even DBpedia is 90% covered by 500 tables, but is unusually heterogeneous, albeit not very large
License CC-BY-SA 4.0 (International).
The Glorious Dawn:Structure is the Servant, not the Tyrant A set of subjects with all the same single-valued
properties is in fact a table. So, store it as a table Allow exceptions, e.g., sometimes multiple values,
different values in different graphs, extra properties, etc. If it is big, it has repeating structure All RDF semantics are preserved; any triple is possible,
but the common ones are SQL compact and SQL fast With tables, query optimization returns to SQL
complexity and is much more reliable So, more tricks from the SQL analytics bag become
safe and applicable License CC-BY-SA 4.0 (International).
Gains from Structure Awareness
3+x Load Speed
2x more space efficiency
SPARQL queries against regular data within 10–20% of SQL speeds
Just declare which properties tend to occur together; no strict schema-first like with SQL
Later, self configuration
License CC-BY-SA 4.0 (International).
The Cycle of Adventure Rebels: SQL not cool, too rigid,
drop ACID, go key-value, map-reduce, the triple is all there is, semantic web
Pioneers: Life on the frontier is hard, infrastructure missing or bad
Same everyday problems also in Utopia
Recognizing the objective values, e.g., schema freedom and identifiers, no AI. Do the job, forget dogma
Reconciliation: schema-first and schema-last converge in structure awareness License CC-BY-SA 4.0 (International).
Present FP7 Research LDBC — Transparency and Relevance for
Graph DB, RDF performance
GeoKnow — GeoData is everywhere, how to carry the planet in your pocket
LOD2 — Where no triple has gone before (and come back)
Open PHACTs — A Data Platform for Drug Discovery
License CC-BY-SA 4.0 (International).
LDBC - Linked Data Benchmark Council
Rebels: SQL not cool, too rigid, drop ACID, go key-value, map-reduce, the triple is all there is, semantic web
Pioneers: Life on the frontier is hard, infrastructure missing or bad
Same everyday problems also in Utopia Recognizing the objective values, e.g.,
schema freedom and identifiers, no AI. Do the job, forget dogma
Reconciliation: Some of the rebel thinking becomes mainstream, e.g., schema-first and schema-last converge in structure awareness
License CC-BY-SA 4.0 (International).
LDBC, Independent Industry Forum for Benchmarking
The TPC for the frontiers of database
Bootstrapped in the LDBC FP7, continues as independent industry association
OpenLink, Ontotext, Neo Technologies, Sparsity as founding members
IBM, Oracle Labs, Systap, SPARQL City already joined
DB superstars Peter Boncz and Thomas Neumann as founders and scientific lead
License CC-BY-SA 4.0 (International).
LDBC Benchmarks
Social Network
Online — Lookups, updates, analysis of social environment
Business Intelligence — Spotting trends, key players, big query
Graph analytics — Community detection, Page rank, graph metrics
Semantic Publishing
Modeled after the BBC linked data portal, online lookups, drill downs and updates
License CC-BY-SA 4.0 (International).
GeoKnow - The Planet in your Pocket
Ms. Globe and Mr. Cube have a thing going on:
Mr. Cube: Desiloization ... integrated metadata ... Explicit semantics .
Ms. Globe: I can feel it ... but are you man enough? ... you need to show me.
License CC-BY-SA 4.0 (International).
Planet Scale Roadmap
Jan 2014:
Virtuoso SPARQL outperforms PostGIS in map lookups with planet-wide Open Street Map
Virtuoso SQL adds 5x more power
License CC-BY-SA 4.0 (International).
Next: Jan 2015
Parity between SPARQL and SQL via structure awareness
Geospatial data clustering
Graph analytics close to the data — Pregel, Giraph, etc., in the DB itself
Adding fine-grained geo dimension to LDBC social network benchmark
License CC-BY-SA 4.0 (International).
The LOD2 scaling adventures
Experiments at CWI’s Scilens cluster Jan 2013: 150 Gtriples (8 x 256GB
RAM) Aug 2014: 500 Gtriples (12 x 256GB
RAM) Some trillion-triple claims exist, but
do not detail any query workload
BSBM explore and BI workloads 10x speed gains for BI queries
between 2013 and 2014
Bulk load at 6M triples/s All done in triples, structure
awareness will go further stillLicense CC-BY-SA 4.0 (International).
Open PHACTsPartners:
License CC-BY-SA 4.0 (International).
Virtuoso NowSnapshot of RDF Linked Data customers in the Enterprise:
Data.Gov (U.S. Govt. Open Linked Data initiative)
Bank of America Booz Allen Hamilton Northrop Grumman Elsevier French National Library Samsung Globo
Daimler Benz Johnson & Johnson Bayer St Jude's Medical Fuijitsu Syngenta and many more
License CC-BY-SA 4.0 (International).
Virtuoso Availability
Most capabilities as open source
Commercial adds Cluster scale-out SQL Federation Replication (SQL & RDF) Advanced RDF security; ABAC & RBAC (ACLs) Wide tables and more
Up to the minute tech previews via v7fasttrack on github, e.g., superfast TPC-H implementation
License CC-BY-SA 4.0 (International).
Virtuoso Future
Preview of structure-aware RDF store in fall 2014 via v7fasttrack
Integrated graph analytics framework
Embed complex graph algorithms, e.g., community detection, shortest path inside SPARQL/SQL
Comparison of SQL and SPARQL for big data analytics
License CC-BY-SA 4.0 (International).
Linked Data Now
Adoption across major industries
Superior flexibility and time to solution
Dramatic performance gains in the last 5 years
Benchmarking will continue to drive progress, to the benefit of users and vendors alike
Run circles around most open source SQL in SPARQL:
Virtuoso SPARQL beats MySQL in SSB by 100x
With structure awareness, SPARQL to match the best in SQL for data warehousing, OLTP
Linked Data no longer a long shot but a technology that makes sense License CC-BY-SA 4.0 (International).
About OpenLink SoftwareOpenLink Software is a privately-held company founded in 1992 by its President & CEO, Kingsley Idehen. The company is an industry acclaimed technology innovator in the following areas:
License CC-BY-SA 4.0 (International).
ODBC, JDBC, ADO.NET, and OLE DB compliant Data Access Drivers for Oracle, Microsoft SQL Server, Informix, Ingres, Sybase, Progress, MySQL, and PostgreSQL
High-Performance & Scalable Multi-Model (Relational & Graph) Database Technology
Data Integration Middleware (Data Virtualization Technology across a wide variety of Protocols & Formats)
Socially-enhanced Distributed Collaborative Applications Platforms (Weblogs, Wikis, Feed Aggregation and Syndication, Web File Systems, Discussion Forums, etc.)
Web Application Server Technology
Linked Data Deployment & Management
Identity Management
Office Locations
USA
OpenLink Software, Inc 10 Burlington Mall Road Suite 265 Burlington, MA 01803 Tel.: +1 781 273 0900 Fax: +1 781 229 8030
UK
OpenLink Software Ltd. Airport House Purley Way Croydon, Surrey CR0 0XZ Tel.: +44 (0)20 8681 7701 Fax: +44 (0)20 8681 7702
License CC-BY-SA 4.0 (International).
Additional InformationWeb Sites
OpenLink Software
YouID – Digital Identity Card (Certificate) Generator
OpenLink Data Spaces – Semantically enhanced Personal & Enterprise Data Spaces & Collaboration Platform
OpenLink Virtuoso - Hybrid Data Management, Integration, Application, and Identity Server
Universal Data Access Drivers - High-Performance ODBC, JDBC, ADO.NET, and OLE-DB Drivers
LDAP and NetID-TLS – How to use LDAP scheme URIs with NetID-TLS Authentication
Social Media Data spaces
http://www.openlinksw.com/weblog/oerling/ (Orri Erling weblog)
http://kidehen.blogspot.com (Kingsley Idehen weblog)
http://www.openlinksw.com/blog/~kidehen/ (Kingsley Idehen weblog)
https://twitter.com/OpenLink (Twitter)
Hashtags: #LinkedData #SemanticWeb #BigData #RDF (Anywhere).
License CC-BY-SA 4.0 (International).