Blackray @ SAPO CodeBits 2009

47
1

description

These are the slides of the BlackRay Talk at SAPO CodeBits in Lisboa, Portugal on December 4th.

Transcript of Blackray @ SAPO CodeBits 2009

Page 1: Blackray @ SAPO CodeBits 2009

1

Page 2: Blackray @ SAPO CodeBits 2009

2

Presentation Agenda

➔ Brief History➔ Technology Overview➔ Positioning towards other Projects➔ Roadmap➔ The Team➔ Wrap-Up

Page 3: Blackray @ SAPO CodeBits 2009

Brief BlackRay History

Page 4: Blackray @ SAPO CodeBits 2009

4

What is BlackRay?

● BlackRay is a relational, in-memory database● SQL with JDBC and ODBC Driver support● Fulltext (Tokenized) Search in Text fields● Object-Oriented API Support● Persistence via Files● Scalable and Fault Tolerant● Open Source, Open Community● Available under the GPLv2

Page 5: Blackray @ SAPO CodeBits 2009

5

BlackRay History

● Concept of BlackRay was developed in 1999 for a Web Phone Directory at Deutsche Telekom

● Development of current BlackRay started in 2005, as a Data Access Library

● First Production Use at Deutsche Telekom in 2006● Evolution to Data Engine in 2007 and 2008● Open Source under GPLv2 since June 2009

Page 6: Blackray @ SAPO CodeBits 2009

6

Why Build Another Database?

● Rather unique set of requirements:● Phone Directory with approx. 80 Million subscribers● All queries in the 10-500 Millisecond range● Approximately 2000 concurrent users● Over 500 Queries per Second (sustained)● Updates once a day may only take several minutes● Index needs to be tokenized (SQL: CONTAINS)● Phonetic search● Extensive Wildcard queries (leading/midspan/trailing)

Page 7: Blackray @ SAPO CodeBits 2009

7

Decision to Implement BlackRay

● Decision was formed in Mid 2005● Designed as a lightweight data access engine● Implementation in C++ for performance and

maintainability● APIs for access across the network from different

languages (Java, C++)● Feature set was designed for our specific business

case in a particular project

Page 8: Blackray @ SAPO CodeBits 2009

8

Current Release

● Current release 0.9.0 released on June 12th, 2009● Production level quality● All relevant index functions for large scale search

applications● APIs in C++ and Java fully functional● SQL still only a small subset of the API functionality

Page 9: Blackray @ SAPO CodeBits 2009

9

Some more details

● Written in C++● Relies heavily on boost● Compiles well with gcc and Sun Studio 12● Behaves well on Linux, Solaris, OpenSolaris and

MacOS● Complete 64 Bit development and platform support● Use of cmake for multi platform support

Page 10: Blackray @ SAPO CodeBits 2009

Technology Overview

Page 11: Blackray @ SAPO CodeBits 2009

11

Why call it Data Engine?

● BlackRay is a hybrid between a relational database and a search engine → thus we call it „data engine“

● Database features:● Relational structure, with Join between tables● Wildcards and index functions● SQL and JDBC/ODBC

● Search Engine Features● Fulltext retrieval (token index)● Phonetic and similar approximation search● Extremely low latency

Page 12: Blackray @ SAPO CodeBits 2009

12

BlackRay Architecture

C++ API

Java API

Management Server

InstanceServer

Data Universe

<

Redo Log

Snapshots

SQLInterface

Postgres*Clients

L5: Multi-Values

L4: Multi-Tokens

L5: Multi-Values

L3: Row Index

L2: Postings

L1: Dictionary

5-Perspective Index

Python API

PHP API

Python API

C# API

Page 13: Blackray @ SAPO CodeBits 2009

13

Hierarchical Model

● Each BlackRay node can hold many Instances● One Management process per node● Each Instance runs in a separate Process● Instances are completely separated from each other● Snapshots (for persistence) are taken on an

Instance level● In an Instance, Schemas and Tables can be created● Within one Instance, Queries can span across

Tables and Schemas

Page 14: Blackray @ SAPO CodeBits 2009

14

Getting Data Into BlackRay

● Once an Instane is created, data can be loaded● Schemas, Tables, and Indexes are created using an

XML description language ● Standard loader utility to load CSV data● Bulk loading is done with logging disabled● Indexing is done with maximum degree of

parallelism depending on CPUs● After all data is indexed, a snapshot can be taken

Page 15: Blackray @ SAPO CodeBits 2009

15

Basic Load Performance Data

● German yellowpage and whitepage data● 60 Million subscribers● 100 Million phone numbers● Raw data approx 16GB

● Indexing performance● Total index size 11GB● Time to index: 40 Minutes, on dual 2GHz Xeon (Linux)● Updates: 300MB, 200K rows in approx 5 minutes

● Time to load snapshot: 3.5 Minutes for 11GB

Page 16: Blackray @ SAPO CodeBits 2009

16

Data Universe

● BlackRay features a 5-Perspective Index ● Layer 1: Dictionary● Layer 2: Postings● Layer 3: Row Index● Layer 4: Multi-Token Layer● Layer 5: Multi-Value Layer

● Layer 1 and 2 comprise a fully inverted Index● Statistics in this Index used for Query Plan Building● All data - index and raw output - are held in memory

Page 17: Blackray @ SAPO CodeBits 2009

17

Snapshots and Data Versioning

● Persistence is done via file based snapshots● Snapshots consist of all schemas in one instance● Snapshots have a version number● To make a backup of the data, simply copy the

snapshot file to a backup media● It is possible to load an older snapshot: Data is

version controlled if older snapshots are stored● Note: Currently Snapshots are not fully portable.

Page 18: Blackray @ SAPO CodeBits 2009

18

Transactions in BlackRay

● BlackRay supports transactions via a Redo Log● All commands that modify data are logged if

requested ● In case of a crash, the latest snapshot will be loaded● Replay of the transaction log will then bring the

database back to a consistent state● Redo Log is eliminated when a snapshot is persisted● For better performance snapshots should be taken

periodically, ideally after each bulk update

Page 19: Blackray @ SAPO CodeBits 2009

19

Native Query APIs

● C++, Java and Python Object Oriented APIs are available

● Built with ICE (ZeroC) as the Network and Object Brokerage Protocol

● Query Objects are constructed using an Object Builder

● Execution via the network, Results as Objects● Load balacing and failover built into the protocol● Native support for Ruby, PHP and C# upcoming

Page 20: Blackray @ SAPO CodeBits 2009

20

Native Query APIs

● Queries can use any combination of OR, AND● Index functions (phonetic, sysnonyms, stopwords)

can be stacked● Token search (fulltext) and wildcard are also

supported● Advantage of the APIs:

● Minimized overhead● Very low latency● High availability

Page 21: Blackray @ SAPO CodeBits 2009

21

Standard Query Interface

● BlackRay implements the PostgreSQL server socket interface

● All PostgreSQL compatible frontends can be utilized against BlackRay

● JDBC, ODBC, native drivers using socket interface are all supported

● Limitations: SQL is only a reasonable subset of SQL92, Metadata is not yet fully implemented...

● Currently not all index functions can be accessed via SQL statements

Page 22: Blackray @ SAPO CodeBits 2009

22

Management Features

● Management Server acts as central broker for all Instances

● Command line tools for administration● SNMP management:

● Health check of Instances● Statistics , including access counters and performance

measurements

Page 23: Blackray @ SAPO CodeBits 2009

23

Clustering and Replication

● BlackRay supports a multi-node setup for fault tolerance

● Cluster have N query nodes and single update node● All query nodes are equal and require the full index

data available● Index creation is done offline on the update node● Query nodes can reload snapshots to receive

updated data, via shared storage or local copy● Load-balancing handled by native APIs

Page 24: Blackray @ SAPO CodeBits 2009

24

Administration Requirements

● In-Memory Databases require very little administrative tasks

● Configuration via one file per Instance● Disk layout etc all are of no importance● Backups are performed by copying snapshots● Recovery is done by restoring a snapshot● No daily administration required● SNMP allows remote supervision with common tools

Page 25: Blackray @ SAPO CodeBits 2009

Positioning BlackRay

Page 26: Blackray @ SAPO CodeBits 2009

26

BlackRay and other Projects

● BlackRay is being positioned as a Query intensive database addition

● Ideally suited where updates are done in Bulk and searches outweigh updates by many orders of magnitude

● Wildcards come with little overhead: No overhead for trailing wildcard, some overhead for leading and midspan wildcard

● Good match when index functions such as phonetic or word rotation/position search combined with relational data are required

Page 27: Blackray @ SAPO CodeBits 2009

27

FOSS Projects with similar Goals

● Relational Databases (the usual suspects): ● MySQL, MariaDB, Drizzle....● PostgreSQL...● Many more alternatives, including embedded etc....

● Fulltext Search: ● Sphinx● Lucene

● In-Memory Databases:● FastDB● HSQLDB/H2 (Java → Garbage collection issues....)

Page 28: Blackray @ SAPO CodeBits 2009

28

Commercial Alternatives

● Commercial In-Memory Databases● ORACLE/TimesTen (Acquired by ORACLE in 2005)● IBM/SolidDB (Acquired by IBM in 2007)● VoltDB (No real data available as of yet)● eXtremeDB (embedded use only)

● Dual-Licensed Alternatives● CSQL (Open Source Version is severely crippled)● MySQL with memcached

Page 29: Blackray @ SAPO CodeBits 2009

29

Is it the right thing for me?

● BlackRay is not designed as a 100% RDBMS replacement

● Questions to ask:● Do I need ad-hoc data updates, or are updates done in

bulk?● How important are fulltext search and extensive

wildcards?● How large is my data? Gigabytes: OK. Terrabytes: Not yet● Do I need a relational data model?● Is SQL an important feature?

Page 30: Blackray @ SAPO CodeBits 2009

30

BlackRay will fit you well when...

… searches outweigh updates

… data is primarily updated in bulk

… fulltext search is important

… you have Gigabytes (not Terabytes) of data

… index functions (phonetic …) are required

… SQL is necessary

… a relational data model is required (JOIN)

… source code must be available

… high availability/clustering is required

Page 31: Blackray @ SAPO CodeBits 2009

Project Roadmap

Page 32: Blackray @ SAPO CodeBits 2009

32

Immediate Roadmap

● Upcoming 0.10.0 – Due in December 2009● Complete rewrite of SQL Parser (boost::spirit2)● PostgreSQL client compatibility (via network protocol) to

allow JDBC/ODBC... via PostgreSQL driver● Rewritten CLI tools● Major bugfixes (potential memory leaks)● Authentication supported for Instances

● BlackRay Admin Console (Remora) 0.10

Page 33: Blackray @ SAPO CodeBits 2009

33

Immediate Roadmap

● Planned 0.11.0 – Due in February 2010● Support for ad-hoc data manilulation via SQL

(INSERT/UPDATE/DELETE) ● Aggregate functions for SELECT● Make all index functions available in SQL● Support for Prepared Statements (ODBC/JDBC) ● Improved thread and memory management (Perftools?)

● BlackRay Admin Console (Remora) 0.11● Engine Statistics via GUI● Cluster Node management

Page 34: Blackray @ SAPO CodeBits 2009

34

Midterm Roadmap

● Scalability Features● Sharding & Partitioning Options● Federated Search

● Fully portable snapshot format (across platforms)● Query Performance Analyzer● Improved Statistics Module with GUI● Removal of ICE as a core component● BlackRay as a Storage Backend for SUN OpenDS

LDAP Engine

Page 35: Blackray @ SAPO CodeBits 2009

35

Midterm Roadmap

● Security Features● Improved User and Access Control concepts● SSL for all connections● External User Store (LDAP/OpenSSO/PAM...)

● Increased Platform support● Windows 7 and Windows Server platforms● Embedded platforms

● Other, random features by popular request.

Page 36: Blackray @ SAPO CodeBits 2009

36

Longterm Roadmap

● Integration with other DBMSs● Storage Engine for other DBMSs

MariaDB/MySQL: → Depends on potential modification needs of the storage engine interfaces

● Trigger-based updates to support BlackRay as a Query cache instance over a regular DBMS

● Update: The Storage Engine Interface Issue was discussed at the OpenSQL Camp 2009 in Portland, Tokutek suggested a portable and more efficient Universal Storage Engine Interface

Page 37: Blackray @ SAPO CodeBits 2009

37

Longterm Roadmap

● Standalone Engine Improvements● SQL92 compliance: SUBSELECT/UNION support● Triggers in BlackRay ● Full ACID Transaction support

Page 38: Blackray @ SAPO CodeBits 2009

The Team behind BlackRay

Page 39: Blackray @ SAPO CodeBits 2009

39

SoftMethod GmbH

● SoftMethod GmbH initiated the project in 2005 and ● Company was founded in 2004 and currently has

10 employees● Focus of SoftMethod is high performance software

engineering● Product portfolio includes directory assistance and

LDAP enabled applications● SoftMethod also offers load testing and technical

software quality assurance support.

Page 40: Blackray @ SAPO CodeBits 2009

40

Development Team

● SoftMethod Core Team● Thomas Wunschel – Lead Developer● Felix Schupp – Project Sponsor● Andreas Meyer – Documentation, Porting● Frank Fiedler – PhD Student in Database Systems

● Key Contributors● Mike Alexeev – Senior Developer● Souvik Roy, Intern at SoftMethod GmbH● Andreas Strafner – Developer, Porting to AIX

Page 41: Blackray @ SAPO CodeBits 2009

41

Thomas Wunschel

● Director of Development, SoftMethod GmbH● Almost 10 years of development experience● Involved with BlackRay and its applications since

2005● Currently involved in the Network Protocol Stack● Lead Designer and Decision Lead for new Features

Page 42: Blackray @ SAPO CodeBits 2009

42

Felix Schupp

● Managing Director, SoftMethod GmbH● Over 10 years of commercial software development● Designer of first BlackRay predecessor in 1999● Project sponsor and spokesperson● Responsible for funding and applications● Guide and coach in the development process

Page 43: Blackray @ SAPO CodeBits 2009

43

Mike Alexeev

● Senior Software Developer● Over 10 years of C++ experience● First outside committer to BlackRay● Currently involved in rewriting the SQL grammar to

support all index features available via the APIs

Page 44: Blackray @ SAPO CodeBits 2009

44

Souvik Roy

● Computer Science Student and Software Developer ● Seceral years of C++ and Java experience● Current Intern at SoftMethod GmbH● Google Summer of Code participant● Working on reliable performance comparison with

other Engines● Provides additional sample applications

Page 45: Blackray @ SAPO CodeBits 2009

Wrap-Up

Page 46: Blackray @ SAPO CodeBits 2009

46

What to do next

● Get BlackRay:● Register yourself on http://forge.softmethod.de● SVN checkout available at

http://svn.softmethod.de/opensource/blackray/trunk● Get Involved

● Anyone can register and create tickets, news etc● We have an active mailing list for discussion as well

● Contribute● We require a signed Contributor agreement before being

allowed commit access to the repository

Page 47: Blackray @ SAPO CodeBits 2009

47

Contact Us

● Website: http://www.blackray.org● Twitter: http://twitter.com/dataengine● Facebook http://facebook.com/dataengine● Mailing List: http://lists.softmethod.de● Download: http://sourceforge.net/projects/blackray

● Felix: [email protected]● Thomas: [email protected]