© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 Massive Scalability...

24
© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 Massive Scalability for RDF Storage and Analysis Presented by David Wood, CTO Tom Adams, Sales Engineer Andrew Newman, Software Engineer Tucana Technologies, Inc. Reston, Virginia USA May 2004

Transcript of © Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 Massive Scalability...

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03

Massive Scalability for RDF Storage and Analysis

Massive Scalability for RDF Storage and Analysis

Presented by

David Wood, CTO

Tom Adams, Sales Engineer

Andrew Newman, Software Engineer

Tucana Technologies, Inc.

Reston, Virginia USA

May 2004

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 2

AgendaAgenda

• The Tucana Knowledge Server and Kowari

• Where we fit

• Performance metrics & scaling

• Real-world deployment examples

• Where are we headed?

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 3

Tucana and KowariTucana and Kowari

• The Tucana Knowledge Server is a secure, distributed, scalable, transaction-safe, native RDF database.– Stores, manages and analyzes RDF data– iTQL/RDQL query language support– Single instance scales to 1B triples– Federated query capability available– JRDF & Jena API support– Pluggable data models (full text, RDBMSs, etc)– Commercial (academic licenses available)– 100% Java 1.4.2

• http://www.tucanatech.com/

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 4

Tucana and KowariTucana and Kowari

• Kowari is the Open Source basis of the Tucana Knowledge Server– MPL v1.1– No security, limited APIs/documentation, no pluggable data models– Limited data types (string, URI, date, datetime, number)– Limited scaling (>10M triples on 32-bit, >50M on 64-bit)– No graph-based analysis algorithm support (graph segment

matching)

• http://www.kowari.org/

Colophon: Kowari is a small Australian marsupial and Tucana is a constellation in the Southern sky.

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 5

Tucana Knowledge Server (TKS) in Enterprise Architecture

Tucana Knowledge Server (TKS) in Enterprise Architecture

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 6

Tucana Knowledge Server (TKS) Data Flow & Federation

Tucana Knowledge Server (TKS) Data Flow & Federation

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 7

Tucana System InterfacesTucana System Interfaces

Data Sources• RDF native

• Structured data sources (e.g. RDBMS) via importation

• Metadata from unstructured data sources via entity extractors

• XML or other tagged formats via XSLT

• Rich Site Summary (RSS) feeds

Access • Web services (SOAP,

WSDL)• COM (ASP, etc.)• JavaBean• Java APIs• JRDF & Jena• JSP tag library• XSLT Descriptors• Query language• Command line• Web UI• RDF/OWL

editors/viewers via evolving industry APIs

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 8

Tucana Supported PlatformsTucana Supported Platforms

• Runs 64- or 32-bit (requires Java 1.4.2)

• GNU/Linux on Intel or Opteron

• Sun Solaris on SPARC or Intel (Opteron coming Dec ‘04)

• Windows on Intel– NT4– 2000– XP

• Note: AIX, HP/UX, Mac OS X operational – Future support on roadmap based upon customer demand

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 9

Performance MetricsPerformance Metrics

• Read/Write comparisons to RDBMSs when storing RDF

• Load performance

• Query execution performance

• Go triple crazy!

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 10

Read/Write ComparisonRead/Write Comparison

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 11

Load PerformanceLoad Performance

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 12

Query Execution PerformanceQuery Execution Performance

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 13

Go triple crazy!Go triple crazy!

• 32 bit: about 100 million statements (using explicit I/O, which is now the default on 32 bit platforms)

• 64 bit: about a billion statements (using mapped I/O)

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 14

Why do we scale?Why do we scale?

• Designed from the ground up to be scalable

• Optimized for reads/very fast writing

• Dealing with low level aspects of file system

• Have lots of room for further speedups– Drop indices, increase triple block size, flatten tree

• Bottlenecks– Virtual memory limits of OS– Thread stacks– Sharing same area of VM

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 15

Real-World Deployment ExamplesReal-World Deployment Examples• Business Needs Satisfied

• Enterprise Software Company

• Automobile Manufacturer

• Genomics Research

• Defense Integrator

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 16

Business Needs SatisfiedBusiness Needs Satisfied

• Get answers to questions– Inferencing and discovery – Change impact and dependency analysis– Variable views of data elements and their relationships

• Unify disparate information sources– Metadata repositories– Unstructured information (MSOffice and PDF documents,

email, content mgt, web pages, RSS sources/news feeds)– Other complex data sources

• Share and re-use knowledge– Within and between enterprises

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 17

Enterprise Software CompanyEnterprise Software Company

• Critical Need: Provide automated document routing based on a business-specific ontology.

• Solution: Classify documents against ontology, store classifications and ontology in the Tucana Knowledge Server and build multiple business applications on top.

• Result: Standards-based metadata management unifies and delivers change impact analysis across a multi-application distributed, staged, software environment.

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 18

Auto ManufacturerAuto Manufacturer

• Critical Need: Analyze quality test and measurement over time for trends– Relying on entrenched vendor - a Tucana OEM– OEM tried RDBMS – not an option

• Solution: Embed Tucana Knowledge Server into OEM’s existing product for test and measurement.

• Result: Enables high value trend analyses that have not been possible before for customer.

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 19

Genomics ResearchGenomics Research

• Critical Need: Collaborative project with big pharma – Concerned with Oracle flexibility & “schema hell” – Need scalability & secure collaborative environment

• Solution: Rapidly analyze data in Tucana Knowledge Server using application they co-develop with integration partner.

• Result: Deliver collaborative research system for use with strategic customer to accelerate joint discovery and competitive advantage to both companies.

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 20

Defense IntegratorDefense Integrator

• Critical Need: Intel agency overwhelmed with data – Automated analysis to improve decision speed & accuracy. – Proto-type software does not scale / agency requires COTS

• Solution: Deploy Tucana Knowledge Server with metadata extraction incumbent (SRA NetOwl) and scale to billions of records

• Result: More automated analysis, faster accurate decisions against large data volumes on scalable COTS platform

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 21

Analyze Disparate Data - NowAnalyze Disparate Data - Now

Query Engine

RDFFull Text(Lucene)RSS Feeds

RDF API

API

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 22

Analyze Disparate Data - SoonAnalyze Disparate Data - Soon

XML DB

XPath

SQL

RDBMS

Query Engine

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 23

Analyze Disparate Data - SoonAnalyze Disparate Data - Soon

Single Query

RepresentationOtherData

Sources

Note: Distributed queries already supported.

RDBMS

© Copyright Tucana Technologies, Inc. 2003-2004. All rights reserved. T004v03 - 24

Thank YouThank YouThank YouThank You

Thank You

David Wood ([email protected])Tom Adams ([email protected])

Andrew Newman ([email protected])

Tucana Technologies, Inc.http://www.tucanatech.com/

http://www.kowari.org/