Post on 31-Oct-2014
description
Which Freaking Database Should I
Use?Andrew C. Oliver
@acoliver
{Great Wide Open | Atlanta}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver• Programming since I was about 8
• Java since ~1997
• Founded POI project (currently hosted at Apache) with Marc Johnson ~2000o Former member Jakarta PMCo Emeritus member of Apache Software Foundation
• Joined JBoss ~2002
• Former Board Member/current helper/lifetime member: Open Source Initiative (http://opensource.org)
• Column in InfoWorld: http://www.infoworld.com/author-bios/andrew-olivero I make fanboys cry.
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
Open Software Integrators• Founded Nov 2007 by Andrew C. Oliver (me)
o in Durham, NC
Revenue and staff has at least doubled every year since 2009.
• New office (2012) in Chicago, ILo we're hiring mid to senior level as well as UI
Developers (JQuery, Javascript, HTML, CSS)o up to 25% travel, salary + bonus, 401k, health, etc etco preferred: Java, Tomcat, JBoss, Hibernate, Spring,
RDBMS, JQueryo nice to have: Hadoop, Neo4j, CouchBase, Ruby, at least
one Cloud platform
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
• Why not just use the RDBMS for everything?
• Operational vs Analytical
• Key Value
• Column Family
• Document
• Graph
• Hadoop?
• Convergence of "clustered filesystems" and "databases"
• Conclusions
Overview
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
{2014 Great Wide Open | Atlanta}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Why Not "Just Use" RDBMS for
Everything?
Before we begin...
• Let's handle the Elephant or rather the teddy bears in the room:
http://highscalability.com/blog/2010/9/5/hilarious-video-relational-database-vs-nosql-fanbois.html/{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
The CAP theorem
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
RDBMS CAP characteristics
• Great at consistency
• Okay at availability
• Not so great at partition tolerance...
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
• Lots of servers with many connections to few servers.
Single process model
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
Multiprocess Model
Data Manager Cluster Manager Data Manager Cluster Manager Data Manager Cluster ManagerData Manager Cluster Manager
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
• 10mb disks were "big"
• Scalability meant more disks, controllers and possibly CPUs
• CPUs went from 4.77 Mhz to 3.4ghz
• Disks went from 64kps@70ms to 6gb/s
• Network speeds went from under 4mb to gigabit to bonded gigabit and beyond.
• Disk speeds for a long time didn't keep up with CPU...
Historical Scalability
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
• RDBMS is based on "Relational Algebra" which is just an extension of basic "set theory"
• Not every problem is a set problem: "direct path" or "which thing contains this other thing which has this other thing" (foaf)
• Sometimes relationships are as important as the data
• Sometimes data is even simpler than the relational model but needs higher levels of availability, etc.
• One size never really did fit all
The Mathematical model
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
Data Complexity
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
Datarrhea
• Yes I've already registered that ;-)
• The cheapness of storing data has yielded more demando economics predicted this
• Moore's law ended while you slepto Intel says next year (but when did CPU speeds
last double?)
• Massive parallelization is the most feasible way to get at it (counter trended with an explosion in disk speeds)
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
...but
• Ifo your data is tabular;o fits cleanly in a relational model;o you aren't having scalability issues;o you don't have a large dataset; oro a dataset/problem that lends itself to massive
parallelization...
• you can probably stick with your RDBMS for nowo ...and probably aren't at this conference
anyhow.{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
JPA/RDBMS Tables Example
PersonID Firstname Lastname CompanyID
2 Andy Oliver 3
CompanyID Name City State
3 Open Software Integrators
Durham NC
PhoneNumber Type PersonID
919.627.1236 google 2
919.321.0119 work 2
Operational vs Analytical
• One DB type is unlikely to be well suited for all of your problems.
• The system doing "short and sweet" "lightweight" transactions is your operational system.
• The system doing long running reports and generating charts and graphs and statistics is your analytical system.
• There is also search. There are recommendation engines, etc.
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Which Freaking Database Should I Use?
{2014 Great Wide Open | Atlanta}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Other Types of Databases
• Examples: Couchbase 1.8, Cassandrao also: Gemfire, Infinispan (distributed caches)
• Constant Time O(1) - Lookup by key
• Good enough for "right now" stock quotes
• Usually combined with an index for search, but the structure isn't inherently indexed.
• Generally works well with Map Reduce.
• Extremely scalable, easy to partition
Key-Value Stores
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
• Many Key-Value support "column families"
o Cassandra
• Some we designed this way
o HBase
• Keys and values become composite
• essentially a hashmap with a multi-dimensional array
o each column is a row of data
• map-reduce friendly
• Stock quote with time ranges
Column Family / Big Table
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
HBase Example
Row key
First name
Last name
Company City StatePhone number
Phone type
5bfbd4a0-d02a-11e1-9b23-0800200c9a66
Andy OliverOpen Software Integrators
Durham NC919-627-1236
7b2435c0-d02a-11e1-9b23-0800200c9a66
Andy OliverOpen Software Integrators
Durham NC919-321-0119
work
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
• Many developers think these are the "holy grail" since the fit nicely with object-oriented programming.
• Couchbase 2.0, CouchDB, MongoDB
• JSON documents
• One way to think of this is a Key-Value store that understands the values.
• Not as map-reduce friendly, larger datasets require indexes.
• clearly rest services, operational store
Document databases
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
• JSON document:{
"firstname" : "Andy", "lastname" : "Oliver", "company" : "Open Software Integrators", "location" : { "city" : "Durham", "state" : "NC" }, "phone" : [ { "number" : "123 456 7890", "type" : "mobile" }, { "number" : "123 654 1234", "type" : "work" } ]}
Document databases
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
• Based on Graph Theory
• Less about volume of the data and more about complexity
• Many are transactionalo often the transactions are "more correct" than
those offered by a relational database.
• FOAF, direct path operations are easyo very complicated/inefficient in RDBMS
• Usually paired with an index for search
Graph Databases
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Design: RDBMS vs Graph
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
Phone Number: 919.627.1236Type : googlevoice
HAS
Phone Number: 919.321.0119Type : work
Company: Open Software Integrators
LOCATED
FOUNDED
Firstname: AndrewLastname: Oliver
City: DurhamState: NC
Neo4j Graph Example
WORKS FOR
LOCATEDCity: ChicagoState: IL
HAS
RESIDES
Note the extra relationships and details here - graph databases are just fun and easy to understand.
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
HAS
• NoSQL
• Software Framework (lots of pieces/lots of choices):
o Pig - scripting language used to quickly write MapReduce code to handle unstructured sources
o Hive - facilitates structure for the data
o HCatalog - provides inter-operability between these internal systems
o HBase - Bigtable-type database
o HDFS - Hadoop file system
• Excellent choice for data processing and data analysis
• MapReduce
Where does Hadoop fit?
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
• Hadoop HDFS is...a distributed filesystem
• So is Gluster, Ceph, GFS, etc
• Hadoop can use Ceph or Gluster in place of HDFS
Convergence of Filesystems and Databases
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
• Triplestoreso Apache Jenna
• OODBMS /ORDMSo Cache
Other Derivatives
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
• Persistence
o Asynch / Synch
• Replication
• Availability
• Transactions / Consistency
• "Locality"
• Language
• Resources
o http://en.wikipedia.org/wiki/Comparison_of_structured_storage_software
o http://sevenweeks.org/
Things you may consider
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
• RDBMS may not scale to your needs
• Your data may not map efficiently to tables
• Key Value Store - data by key, fast, scalable, can't handle complex data
• Column Family/Big Table - fast, scalable, denormalized, map reduce, good for series, not efficient for complex data
• Document - a good operational system, not your analytical, moderately scalable, matches OO
• Graph - great for complex data, transactional, less scalable
• Filesystems and "databases" are converging
Conclusions
Which Freaking Database Should I Use?
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver@acoliver
{2014 Great Wide Open | Atlanta}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Thank you for attending!