NoSQL - Oracle Platinum Partnerredstk.com/wp-content/uploads/2015/06/NoSQL-Thought-Leadership...3...

13
1 Red Stack Tech Ltd James Anthony Technology Director NoSQL a view from the top Part 1

Transcript of NoSQL - Oracle Platinum Partnerredstk.com/wp-content/uploads/2015/06/NoSQL-Thought-Leadership...3...

1

Red Stack Tech Ltd James Anthony Technology Director

NoSQL… a view from the top

Part 1

2

Contents

Introduction……………………………………………………………………………….…..Page 3

Key Value Stores………………………………………………..………………………...…Page 4

Column Family Data Stores………………………………………………………….….…Page 6

Document Data Stores…………………………………………………….………………..Page 8

Graph Data Stores……………………………………………………..…………………...Page 11

Contact Red Stack Tech…………………………..……………………………………....Page 13

3

Introduction

The NoSQL bandwagon is really rolling right now, but it always strikes me that there is a lot of confusion (mostly understandable) in exactly what NoSQL is, where it’s used and how it replaces (or otherwise) the traditional RDBMS.

A lot of the press and reporting of NoSQL databases seems to focus on the threat they pose to the RDBMS, indeed I think it’s fair to say a few obituaries have already been written… all of which I think will come back to haunt the authors just like those who predicated the death of the mainframe in the 90s, 00s etc.

This article will explain what NoSQL is, and hopefully we’ll set a few records straight. However, don’t take this as gospel; do your own research, create your own use cases and see how NoSQL can benefit you (and I’m pretty sure it will).

So first let’s get a few things straight, and first off:

NoSQL ≠ BigData

So that’s that stated very clearly! NoSQL databases (perhaps better called Data Stores in many cases) are often linked to BigData processing - perhaps as the store for MapReduce data, or perhaps as the store of BigData but they aren’t always the same thing. Indeed one of our own use cases for NoSQL is definitively small data (in the range of a few 10s of GB) the technology just suited our use case down to the ground - but more of that later. But the “what’s the difference between NoSQL and BigData?” question is the one I’m asked the most at places like Oracle OpenWorld and the OUG.

The second thing I’d like to clear up: NoSQL isn’t a single thing!

Now this is going to be more confusing, and even as I write this I suspect someone, somewhere, is adding to the list; but right now I’ll generalise NoSQL data stores into 4 categories:

1) Key Value 2) Column Family 3) Document 4) Graph

4

I will discuss (briefly) each of these types to give you the grounding and allow you to see how Oracle’s offering differs from some of the other databases out there. Finally before we start just a quick note; I will discuss a few NoSQL products in here, this shouldn’t be considered an exhaustive list, nor does it indicate any preference on my part, it’s simply perhaps showing you the most common ones we come across in our work. If you come from one of the other products don’t shoot me on this one, and it certainly doesn’t mean I’ve forgotten about you!

In this article I’m not going to cover the underlying technical architecture. Forgive me if you want to dive straight into ring spaces, compaction, avro, sharing etc! I think it’s important to recognise that most NoSQL databases were created to address a perceived inability of the RDBMS to provide certain capabilities, be that multi-geography active-active configurations, very large data volumes or horizontal scale out.

Anyway, for now let’s get back to the different types of NoSQL database/data store.

Key Value Data Stores

Key Value (KV) NoSQL databases (of which the Oracle NoSQL database is one) are in some ways the easiest to discuss, so we can start with this type. KV is exactly what it says… you have a key and a value. You store “stuff” by key and you retrieve stuff by key… as simple as that! Let’s look at some code (bear with me!)

Putting Data into the NoSQL Database: NoSQLDataAccess.getDB().put(myKey, value);

Getting data out of the NoSQL Database: ValueVersion vv = NoSQLDataAccess.getDB().get(myKey);

So ignore the bits that aren’t in bold, for now that’s not as relevant, and just focus on the bits in bold. See how simple that is? You store stuff with a key and you get stuff with a key! Ok, it’s got to be more complex than that? Well obviously, otherwise I could teach my 5 year old to do it, so let’s deal with the obvious questions first:

Question 1: What does the key look like? For those of us from the RDBMS background the key isn’t like a primary key, it’s probably not a single value. A key is much more likely to be a multipart string, perhaps with both a primary and secondary component, and we refer to this as the Key Path. Let’s take a simple example first as we’ll explore this type specifically when we are looking at the Oracle NoSQL Database.

5

/stocktick/symbol/time

So the key is firstly a string, and contains multiple parts. The first part identifies the Key as a stock tick, this isn’t actually anything other than an identifier- it allows us to identify the data type we want when we’re using the Data Store for multiple different types of data, so if I stored session data in there my key path might look like /sessiondata/sessionid. The next two entries denote the stock symbol (ORCL for instance) and the time at which that tick occurred.

Question 2: What’s the value? This might sound like an obvious question, but actually it’s very relevant. The value is whatever you want it to be, and as simple or as complex as you care to make it. It could be a simple String, but more likely it’s going to be an object of some type, perhaps the serialisation of internet session data or a JSON document containing a lot more information in a structured format.

Taking my previous example of a stock tick, the value might look like

Value: {

“name" : “TickData",

"namespace" : "com.companyX.stockticker.avro",

"type" : "record",

"fields": [

{"name": “currBid", "type": “double", "default": ""},

{"name": “currAsk", "type": “double", "default": ""},

{"name": “currVolume", "type": “long", "default": ""}

]

}

Again ignore the stuff that’s not in bold. We have a value defined that is called TickData and contains a record with three fields in, bid, ask and volume. The value isn’t just a simple string but allows for complexity.

6

Column Family Data Stores

Perhaps the most “famous” type of NoSQL database (although MongoDB has to be up there from our next category) is in the form of HBase, Google BigTable and Cassandra, and certainly the most synonymous with BigData. Column family data stores (not to be confused with columnar store databases!) are formally defined as “sparse, distributed, persistent, multi-dimensional sorted map”… and I’m pretty sure that makes that entirely clear for everyone and I can leave it at that? No? Alright then let’s try and explain….

I generally find when trying to explain this type of data store within Red Stack Tech it’s best to use an example, so let’s for one moment imagine we are a newly formed company looking to index web pages and provide a brand new web search facility at blazing speed, let’s just call ourselves Goggle!

What we need to do is index data in rows and columns but we also want to add an extra dimension in time (because web pages and their embedded links change over time).

So now we have the following: Index : (row, column, time)

And we’ve also crossed off the first part of that fairly long winded formal definition "sparse, distributed, persistent, multi-dimensional sorted map”, as we now know where our multi-dimensional comes from.

Let’s look at what a row might look like... and this time I’ll use an example

For now ignore the way the URL has been represented; I’ll explain that shortly. Notice how we allow for multiple versions of a search of the URL to be stored (multi-dimensional), think also how for each URL we’re going to have a totally different number of links embedded within the page that we want to store in columns so each row can have an arbitrary number of columns, and this forms the sparse part of our definition.

Figure 1

7

What next? Well we can deal with the distributed and persistent part in a single explanation. Databases such as HBase use an underlying data store, HDFS (Hadoop Distributed File System) to persist the data, and they actually persist to an immutable file in this case (changes are dealt with by creating new copies). These file systems are designed not just to persist the data but also to distribute it across multiple locations, both for data protection but also to allow processing to be moved to the locality of the data and parallelised across many nodes. Ok, so at this point we’re up to... "sparse, distributed, persistent, multi-dimensional sorted map”, well hopefully from the above image you can see the map portion too, and I promised to come back to the way the URL was portrayed. This is a good example of why sorting and locality work, and the reason I left this until after the discussion of distribution.

Let me illustrate… let’s say I’m indexing www.e-dba.com as in the example above. The e-DBA Tech domain has a bunch of sub domains hanging off it, not just www perhaps something like this

blogs.e-dba.com demos.e-dba.com mobile.e-dba.com www.e-dba.com

etc…

Now add into the mix the millions (billions?) of other domains out there and you can see that if I don’t reverse the sort order and I just work off the lexicographical order if I want to re-index the whole of e-dba.com and all its subdomains I’ve got a lot of places to store this in. I’ve also got a lot of places (and by places I mean disk locations or different servers) to hit to service a search on all of the e-dba.com domains. Reversing the order;

com.e-dba.blogs com.e-dba.demos com.e-dba.mobile com.e-dba.www

This means that my key values will all be stored in the same location providing greater locality of data and allowing me to record and retrieve results much faster. So there we have an example of sorting and we’ve finally covered the definition! Phew!

Finally before we leave Column family data stores a couple of points, firstly lots of other stuff happens inside these databases such as compression (allowing faster scanning), MapReduce integration etc. but in general this form of database has limited queries with no joins etc. (although mechanisms exist to work around this). If you expect to just fire up one of these and have the same sort of analytics as your Oracle RDBMS you’ll probably be disappointed. As always though that’s not to say things aren’t changing rapidly, and that it doesn’t fit your needs.

8

Document Data stores

Recently we’ve seen a huge rise in the number of people exploring Document data stores, and in particular MongoDB. Much of this is fuelled by developers, with them seeing the product as a very attractive, developer led solution. Document databases such as MongoDB and CouchDB are actually one of the easiest to explain, in that they are similar to KeyValue stores but the value is always a document - most often in the JSON or BSON format.

JSON document storage is a huge benefit for many developers in that it offers a “schema-less” design. This isn’t to say there is no structure to the data, quite the opposite, but rather that the schema is flexible and can be modified by the developer. Again, perhaps a small example illustrates best. Let’s start with a simple JSON document for storing customer information.

{

"firstName": "James",

"lastName" : ”Anthony",

"age" : 38,

"address" :

{

"streetAddress": ”Farr House",

"city" : ”Chelmsford",

”county" : ”Essex",

"postCode" : ”CM1 1QS"

},

}

The first thing you’ll probably realise from this if you’re from a RDBMS background is that it’s very much denormalised, and that’s a key thing to remember, document stores are typically denormalised, with the document providing all of the data about the entity you’re interested in. Clearly this has advantages and disadvantages, and we’ll talk about some of these shortly.

9

Now let’s say the developer has stored this information, but then the application scope changes and we also want to capture phone number information, in JSON based development that’s easy, we just change the document... no underlying “fixed” tables/columns to deal with...

{

"firstName": "James",

"lastName" : ”Anthony",

"age" : 38,

"address" :

{

"streetAddress": ”Farr House",

"city" : ”Chelmsford",

”county" : ”Essex",

"postCode" : ”CM1 1QS"

},

"phoneNumber":

[

{

"type" : ”work",

"number": ”01245200510"

},

]

}

10

Extending this further, one of our other customers provides two phone numbers, and we allocate these to different fields within the record type...

{

"firstName": “Alex",

"lastName" : ”Louth",

"age" : 37,

"address" :

{

"streetAddress": ”Farr House",

"city" : ”Chelmsford",

”county" : ”Essex",

"postCode" : ”CM1 1QS"

},

"phoneNumber":

[

{

"type" : ”work",

"number": ”01245200510"

},

{

"type" : ”mobile",

"number": ”0777 111 222"

}

]

}

Hopefully you can see how this flexibility is something developers love, no need to keep going back to the design phase, no need to get DBAs to modify the structure and no ORM layers to deal with. Indeed so popular is this model that at OOW2013 I attended a great session showing the upcoming JSON storage facilities within the Oracle 12c database that will provide exactly this sort of functionality but with all

11

the benefits of the RDBMS behind it and access to the data through both SQL and Restful services, personally I think this will change the game somewhat and the “flexibility” and developer led drive for document databases will be more of a level playing field between the Oracle RDBMS and NoSQL with Oracle offering all of the functionality, plus arguably more.

So what are some of the drawbacks of the JSON model? Well denormalisation clearly increases the storage requirements, and you don’t get the ability (easily) to do other functions such as scan for all customers with a given record type (and clearly you’d have to retrieve a LOT of data). MongoDB and others are also now providing secondary indexing to address some of these issues but it doesn’t take much to realise that this denormalised, read everything about an entity, models are somewhat contradictory when other databases such as columnar storage databases (and the Oracle InMemory database coming in 12c) show how reading individual columns when performing analytics provide massive performance gains.

Graph Data Stores

Graph databases are the final type of NoSQL database I’ll cover, and probably the most niche. Having said that these are niche, they are becoming more prevalent with people now using Facebook Graph search and the release of the RDF Graph for Oracle NoSQL Database! So what is a Graph data model and how does it differ?

Graph databases are all about the relationships between entities rather than the entities themselves, and schemas evolve by adding new relationships. At this point you’re probably thinking “but my RDBMS does this with Foreign Keys”. Well sort of yes, but it is much more about the type of questions you ask of the database.

Graph databases support query and discovery using Graph patterns and traversals, meaning we ask questions about reachability, connectivity, “same as” and proximity. A classic example of this might be “Who is part of this group”, and extending this out “Who is a friend of all the people within this group.”

Figure 2

12

The basic structure of graph storage is the triple

Figure 3

With triples connected to form the Graph. Just like JSON, KeyValue and Column family databases the schema doesn’t have to be defined up front and is flexible in its implementation, with new triples being added and the relationships between entities defined as we go. In Graph databases it is the “edges” (the connections) between Vertices (the entities/nodes) we are interested in using for traversal.

Figure 4

Figure 5

Personally I can see Graph databases becoming more popular over time, as we move toward modelling relationships between entities.

13

Contact Red Stack Tech for more information…

UK Headquarters:

3rd Floor Farr House 27 – 30 Railway Street Chelmsford Essex England CM1 1QS

Main: 0844 811 3600 Direct: 01245 200 510

Australia Headquarters: Suite 3 Level 19 141 Queen Street, Brisbane, QLD 4000

Main: +61 (0) 7 3210 0132

Email: [email protected] Web: www.redstk.com

Follow Red Stack Tech on Twitter: @redstacktech

Media Enquiries:

Elizabeth Spencer [email protected] 01245 200 532

Red Stack Tech Ltd 3rd Floor Farr House 27 – 30 Railway Street Chelmsford Essex England CM1 1QS