Ben Szekely April, 2007
description
Transcript of Ben Szekely April, 2007
© 2007 IBM Corporation
Boca – features of an enterprise-ready Semantic Web storage system
part of the open-source IBM Semantic Layered Research Platform http://ibm-slrp.sourceforge.net
Ben SzekelyApril, 2007
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
RDF: Representing data as a graph
RDF models data’s content and meaning, rather than just its structure or serialization
RDF can more accurately represent the entities being modeled
– Real objects, concepts, and processes often have a ragged shape
– Can represent objects with complex structures directly without exposing implementation techniques
Data schemas do not need to be determined a priori
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
RDF: Example
DavidJGrossman
name
phone“David J Grossman”
“693-0120”
RDF describes relationships as a directed graph with labeled nodes and edges.
Subject
Predicate
Object
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
RDF: Giving resources and relationships unique names
Name everything with URIs (Universal Resource Identifiers)
Ensures that resources, attributes, relationships, and data types have unique names that can be widely shared
Delegates identifier creation down to smaller expert groups
Can often be dereferenced to find their defined meaning
Are often long enough to be human readable
http://www.ibm.com/people/DavidJGrossman instead of DavidGrossman or Emp12345
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Most examples of RDF triple stores focus on specific difficult problems
Focused on inference or standards
Preoccupied with “Billions of Triples”
Little thought given to application programming model
Not multi-user (limited security)
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Boca Overview – Multi-user, distributed enterprise RDF repository
Selective RDF replication from server to client machines
Security, including named-graph-
based RDF access control
Audit trails of changes to data within named graphs
Near real-time event notifications
Sophisticated programming model
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Underlying Technologies
Relational Database (DB2, Oracle, MySQL)– RDF triples stored in a table (subject, predicate, object, named graph)
– Save space by normalizing URIs and strings to integer ids.
– Extra tables for history, ACLs, replication
J2EE (Jetty, Tomcat, WebSphere)– Jetty: Standalone server, checkout from CVS and run for testing
– WAS: Enterprise-ready Web-application server for real deployment
JMS Server (Active MQ, WebSphere MQ)– pub-sub messaging used for real-time notifications of triple updates.
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Named Graphs
A named graph is the logical unit of RDF storage in Boca. The named graph is the first-order unit of data access in the Boca
programming model. Each triple exists in exactly one named graph
– The same S,P,O in two different graphs implies two separate statements.
– Adding and removing triples is done in the context of a named graph
Each named graph has a metadata graph, containing information such as ACLs
Named graphs can be exposed via URLs, Web Services, LSIDs
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Replication
Boca clients have a persistent local RDF store that mirrors a subset of the triples on the Boca server.
Replicated subset specified by:– Triple patterns; e.g.
(<http://tdwg.org/meetings/GUID-2#>, <http://tdwg.org/preds/hasParticipant>,*)
– Named graph URIs
– Triple patterns within named graphs
When a replication is initiated, the service computes what has changed in the subset based on pattern and graph subscriptions.
Replication can work as a background process on the client, or be explicitly initiated.
Applications can query/write against graphs in the local and server models.
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Notification – maintaining the replica in real-time
Updates to named graphs on server are published in near real-time to clients.
Local replicas can be kept up-to-date between replications.
Notification is central to distributed RDF applications– Ex: workflow, collaboration
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Access Controls
Boca uses can have the following system-wide permissions: – canInsertNamedGraphs -- a user must have this permission in order to create a
new named graph (i.e. insert statements into a graph that does not yet exist in the system)
Boca users can have the following per-named-graph permissions (these apply also to the system graph):
– canRead -- a user with this permission may view the triples in the named graph and in its metadata graph
– canAdd -- a user with this permission may insert new triples into the named graph
– canRemove -- a user with this permission may remove triples from the named graph
– canChangeNamedGraphACL -- a user with this permission may change the ACL triples in the metadata graph
– canRemoveNamedGraph -- a user with this permission may entirely remove the named graph from the system
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Versioning
SVN-like approach to versioning
When a triple is added to or removed from a named graph, a new revision of that named graph is created.
Simple API for reading old revisions
Provides a straightforward mechanism for concurrent distributed computing.
– When a client submits an update to a named graph, it may specify the version number that it currently has. The update will fail if the graph has been more recently modified.
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Querying Boca
Users may query Boca in a variety of ways.
– Query the complete database
– Query a subset of named graphs
– Query a particular named graph
– Query the local store
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
SPARQL: Querying any data as RDF
SPARQL is a SQL-like language for querying distributed RDF graphs RDF can be created on-the-fly from any data source SPARQL is designed to handle:
– Distributed data. Multiple distributed data sources can be queried at once because SPARQL addresses graphs by URI.
– Ragged data. The SPARQL OPTIONAL keyword lets users explore heterogeneous data in a single query.
– Unpredictable data. The ability to query for predicates and information about predicates makes SPARQL ideal for exploring new and unexpected data.
– Open-world assumption
Example:– Show me all the AP stories where IBM is mentioned along with another Fortune 500
company. If present, also include the names of any analysts quoted in the article.
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
SPARQL: Example
http://…/DavidJGrossman
nameemail
phone“David J Grossman”
“693-0120”
SELECT ?name ?phone
WHERE {
?person <email> “[email protected]” .
?person <phone> ?phone .
?person <name> ?name .
}
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Abandoned features – Collections, Statement ACLs & Reification
Collections – a statement can exist in multiple collections– A more difficult programming model, what happens when I delete in the context of one
collection?
– Expensive to maintain
– Not a widely accepted programming model (as named graphs are)
Statement-level ACLs– Too expensive
– Difficult to program
– Not particularly useful, other than the odd, very important statement
– In that case, such a statement can live in its own named graph Reification
– Queries were very difficult to formulate
– Most RDF applications do not deal with reification
– Reification semantics often confused with true quoting
– Reification is an arbitrary layer of indirection that can be solved with ontologies
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Future Features
Arbitrary query-based replication/notification
Distributed servers
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Building (Semantic (Web) Applications) atop Boca
Visualization of Semantic Data
Generation of Semantic Data via forms and drag ‘n’ drop.
SPARQL query interfaces
Semantic Annotation
Merging data from multiple sources
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Semantic Web Application Challenges
The Boca API is relatively simple, but probably not simple enough for the breadth of Web developers we want to reach
Lack of good RDF tooling on the browser
Overwhelming choice of transport protocols for AJAX requests
Semantic content management
Binding of RDF data to DHTML widgets
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Queso – A Semantic Web Content Management System
Boca Atom Publishing Protocol endpoint
– Data enters and leaves the system through APP REST API
– Post, Put, Get, Delete, Undelete, Purge– Atom entries stored in Boca, Binary content in file system
– Revision histories of entries and binary data provided via feeds
– Elaborate caching mechanisms
– Optimistic concurrency through Boca preconditions
RDF-DHTML Widget data binding system– Collections of Dojo widgets, grouped into lenses
– A lense renders a named graph whose URI is of a certain rdf:type, or the results of a standing SPARQL query
– A Javascript/HTTP servlet-based infrastructure manages replication of data between browser models/widgets and the server
– No RDF manipulation on browser!
IBM Internet Technology
Features of an Enterprise-ready Semantic Web Storage System – SICoP 2007 © 2007 IBM Corporation
Conclusion, future work, questions?
Boca will continue to be supported by IBM as an open source project for the foreseeable future
The number of adopters of Boca continues to grow, both within IBM and without