New databases that scales high
-
Upload
william-el-kaim -
Category
Technology
-
view
474 -
download
0
Transcript of New databases that scales high
New Databases That Scales High
William El Kaim
Oct. 2016 - V 2.1
This Presentation is part of the
Enterprise Architecture Digital Codex
http://www.eacodex.com/Copyright © William El Kaim 2016 2
Plan
Why the Need for Databases That Scales High?
• What is NoSQL?
• NoSQL Database Taxonomy
• From CAP to PACELC
• NoSQL Database Comparisons and Decision Tree
• NoSQL Data Modeling Techniques
• Resources
Copyright © William El Kaim 2016 3
The Need For Scalability
• Today, most structured data storage is managed in a relational database.
• Relational databases enforce a set of rules to ensure that data is consistent
and to ensure transactions are atomic (they either succeeded or failed).
• With these rules in place, it becomes much harder to ensure transaction consistency
across one or more database servers and spread data out to multiple nodes to increase
retrieval speed and therefore processing speed.
• This is due to the requirement that the storage of data must occur on each database
server, limiting vertical scaling to the speed of the slowest server’s speed of storage.
• While transaction consistency may be critical for some systems, when
datasets reach extreme scale, traditional databases often cannot keep up
and require alternative approaches to data storage and retrieval.
• Newer databases are offering different approaches to overcome these
limitations, as data grows beyond the reach of a single server or database
cluster.
Source: James HigginbothamCopyright © William El Kaim 2016 4
Issues with scaling up
• When dataset is too big “vertical” scaling (increase the computer power of a
single machine) is not enough!
• Need to scale out and put in place “horizontal” scaling by adding more servers.
• Two main approaches for horizontal scaling (multi-node databases)
• Master/Slave
• All “write” are written to the master
• All “read” are performed against the replicated slave databases
• Critical “read” may be incorrect as “write” may not have been propagated down
• Large datasets can pose problems as master needs to duplicate data to slaves
• Master is a Single Point of Failure!
• Sharding (Partitioning)
• Scales well for both “read” and “write”
• Application needs to be partition-aware, no transparency
• Can no longer have relationships/joins across partitions
• Loss of referential integrity across shards
Copyright © William El Kaim 2016 5
Other Ways To Scale Out
• Multi-Master replication
• data “write” by a group of computers, and updated by any member of the group.
• All members can respond to “read”.
• The multi-master replication system is responsible for propagating the data modifications
made by each member to the rest of the group, and resolving any conflicts that might
arise between concurrent changes made by different members.
• Immutable, append only data store (Big Data)
• Do INSERT only, no UPDATE or DELETE
• Keep information
• In-memory databases
• No JOIN
• This involves de-normalizing data
Copyright © William El Kaim 2016 6
Techniques to Scale Out
7Copyright © William El Kaim 2016 Source: Felix Gessert
Plan
• Why the Need for Databases That Scales High?
• What is NoSQL?
• NoSQL Database Taxonomy
• From CAP to PACELC
• NoSQL Database Comparisons and Decision Tree
• NoSQL Data Modeling Techniques
• Resources
Copyright © William El Kaim 2016 8
What is NoSQL?
• NoSQL
• Stands for Not Only SQL
• The term NOSQL was introduced by Carl Strozzi in 1998 to name his file-based
database
• It was again re-introduced by Eric Evans when an event was organized to discuss open
source distributed databases
• Eric states that “… but the whole point of seeking alternatives is that you need to solve a
problem that relational databases are a bad fit for. …”
• is a new non relational approach to data management that supports dynamic and
flexible schemas, optimized storage for web scale, and extreme performance as well as
makes semi-structured and unstructured data easier to use and access.
• Three major papers were the “seeds” of the NOSQL movement:
• BigTable (Google): Article / Web Site
• DynamoDB (Amazon): Article / Web Site
• CAP Theorem
Copyright © William El Kaim 2016 9
What is NoSQL?
• Class of non-relational data storage systems
• Usually do not require a fixed table schema nor do they use the concept of joins
• All NoSQL offerings relax one or more of the ACID properties
• Cheap, easy to implement (open source)
• Data are replicated to multiple nodes (therefore identical and fault-tolerant)
and can be partitioned
• Down nodes easily replaced, No single point of failure
• Don't require a schema
• Can scale up and down
• Relax the data consistency requirement
Copyright © William El Kaim 2016 10
Plan
• Why the Need for Databases That Scales High?
• What is NoSQL?
NoSQL Database Taxonomy
• From CAP to PACELC
• NoSQL Database Comparisons and Decision Tree
• NoSQL Data Modeling Techniques
• Resources
Copyright © William El Kaim 2016 11
NoSQL Database Taxonomy
Copyright © William El Kaim 2016 12Source: Highly Scalable
Key-value Store
• Key-value store consists of a set of key-value pairs with unique keys.
• Due to this simple structure, it only supports get and put operations
• The stored value is transparent to the database, pure key-value stores do not support
operations beyond simple CRUD (Create, Read, Update, Delete).
• Key-value stores often referred as schemaless
• Any assumptions about the structure of stored data are implicitly encoded in the
application logic (schema-on-read) and not explicitly defined through a data definition
language (schema-on-write).
• The obvious advantages of this data model lie in its simplicity.
• The very simple abstraction makes it easy to partition and query the data, so that the
database system can achieve low latency as well as high throughput.
• However, if an application demands more complex operations, this data model is not
powerful enough.
Copyright © William El Kaim 2016 13Source: Felix Gessert
Key-value Store
• Properties
• Focus on scaling to huge amounts of
data
• Designed to handle massive load
• Based on Amazon’s dynamo paper
• Examples
• (AP): DynamoDB, Riak, Voldemort
• (CP): Redis, Scalaris
Copyright © William El Kaim 2016 14
Source: Felix Gessert
Wide-Column Store
• Wide-column stores inherit their name from the image that is often used to explain the underlying data model: a relational table with many sparse columns.
• Technically, however, a wide-column store is closer to a distributed multi-level sorted map:• The first-level keys identify rows which themselves consist of key-value pairs and are called row keys
• The second-level keys are called column keys.
• This storage scheme makes tables with arbitrarily number of columns feasible, because there is no column key without a corresponding value. • The set of all columns is partitioned into so-called column families to colocate columns on disk that
are usually accessed together.
• On disk, wide-column stores do not colocate all data from each row, but instead values of the same column family and from the same row.
• Hence, an entity (a row) cannot be retrieved by one single lookup as in a document store, but has to be joined together from the columns of all column families.
• However, this storage layout usually enables highly efficient data compression and makes retrieving only a portion of an entity very efficient.
• The data are stored in lexicographic order of their keys, so that data that are accessed together are physically co-located, given a careful key design.
Copyright © William El Kaim 2016 15Source: Felix Gessert
Wide-Column Store
• Properties • Also called extensible record stores
• Store data in records with an ability to hold very large numbers of dynamic columns
• Name and format of the columns can vary from row to row in the same table
• Since the column names as well as the record keys are not fixed, and since a record can have billions of columns, wide column stores can be seen as two-dimensional key-value stores.
• Examples• (AP): Apache Cassandra,
• (CP): Apache Hbase, Apache Accumulo, Google Bigtable, Hypertable
• (AC): Vertica
Copyright © William El Kaim 2016 16
Source: Felix Gessert
Document Store
• Definition• A document store is a key-value store that restricts
values to semi-structured formats such as JSON documents.
• Properties• This restriction in comparison to key-value stores
brings great flexibility in accessing the data. It is not only possible to fetch an entire document by its ID, but also to retrieve only parts of a document, and to execute queries like aggregation, query-by-example or even full-text search.
• Can model more complex objects
• Data model: collection of documents
• Document: JSON, XML, other semi-structured formats.
• Examples• (AP): Apache CouchDB, Riak, SimpleDB,
CouchBase
• (CP): MongoDB
Copyright © William El Kaim 2016 17
Source: Felix Gessert
Graph Databases
• Properties
• Focus on modeling the structure of data
(interconnectivity)
• Inspired by mathematical Graph Theory
• Nodes and edges and key-value pairs on
both
• Nodes may have properties (including ID)
• Edges may have labels or roles
• Examples
• Apache Hama, FlockDB, InfoGrid, Neo4j,
Pregel, Titan
Copyright © William El Kaim 2016 18
Source: Felix Gessert
Plan
• Why the Need for Databases That Scales High?
• What is NoSQL?
• NoSQL Database Taxonomy
From CAP to PACELC
• NoSQL Database Comparisons and Decision Tree
• NoSQL Data Modeling Techniques
• Resources
Copyright © William El Kaim 2016 19
Brewer’s CAP Theorem
• For any system sharing data (or multi-node database), it is “impossible” to
guarantee simultaneously all of these three properties:• Consistency: all copies have same value
1. Strong consistency – ACID (Atomicity, Consistency, Isolation, Durability)• Atomicity: either the whole process is done or none is
• Consistency: only valid data are written
• Isolation: one operation at a time
• Durability: once committed, it stays that way
2. Weak consistency – BASE (Basically Available Soft-state Eventual consistency)
• Availability: reads and writes always succeed
• Partition-tolerance: system properties (consistency and/or availability) hold even when network failures prevent some machines from communicating with others
• But to scale out, you need to partition!
• That leaves either consistency or availability to choose from. In almost all cases,
availability is chosen over consistency!
Copyright © William El Kaim 2016 20
BASE vs. ACID
• Rise of the BASE (Basically Available, Soft state, Eventual consistency)
model
• Basically Available - system seems to work all the time
• Soft State - it doesn't have to be consistent all the time
• Eventually Consistent - becomes consistent at some later time
• In other words:
• When no updates occur for a long period of time, eventually all updates will propagate
through the system and all the nodes will be consistent.
• For a given accepted update and a given node, eventually either the update reaches the
node or the node is removed from service.
• Google, Yahoo, Facebook, Amazon, eBay all adopted CAP and BASE
principles!
• Read the blog post from the CTO of Amazon to learn more on the subject
Copyright © William El Kaim 2016 21
Another NoSQL Taxonomy
22Copyright © William El Kaim 2016
CAP Based Taxonomy
1. Relational databases choose
Consistency and Availability (CA),
ensuring writes are consistent
and immediately available across
all instances.
2. Many new database vendors are
opting for Availability and Partition
Tolerance (AP), accepting
new/updated records without
immediate confirmation (aka:
“eventually consistent”).
3. Other database vendors are
opting for Consistency and
Partition Tolerance (CP), allowing
arbitrary loss of messages to
some instances, while the system
continues to be available.
Copyright © William El Kaim 2016 23
Source: Nathan Hurst
Criticisms of CAP Theorem
• The first confusion is about the existence of CA systems, which pretend that
partition tolerance is optional, or claim that partitions don't happen.
• In reality, you can't sacrifice partition tolerance, because partitions happen in real large-scale
systems all the time.
• The second misconception is that the CAP theorem means you can't be
consistent and available during partitions. That's not true.
• Specifically, the CAP theorem only prevents everybody from being consistent and available,
not anybody (some literature calls this always available). It doesn't prevent clients and replicas
on the majority side of simple partitions from making progress, and experiencing both
consistency and availability.
• The third misconception is that the consistency in CAP is all or nothing, and that
you can't offer any consistency guarantees at all during partitions. In reality, many
very useful consistency models can be offered on all sides of a partition.
• Implementation tricks like session stickiness and client-side caching can allow systems to offer
useful models like read your writes, monotonic reads and even causal consistency. Bernstein
and Das, and Bailis et al have good overviews of some of the possibilities.
Copyright © William El Kaim 2016 24Source: Marc Brooker
Criticisms of CAP Theorem
• The fourth misconception is that eventual consistency is all about CAP, and
that everybody would chose strong consistency for every application if it
wasn't for the CAP theorem.
• Another source of confusion is the different versions of the CAP theorem,
from Brewer's original version, to his later writings to Gilbert and Lynch's
proof.
• Some seem to call the former Brewer's Conjecture and the latter the CAP theorem, but
this usage is far from universal. Typically, they're both just called CAP or the CAP
theorem.
Copyright © William El Kaim 2016 25Source: Marc Brooker
PACELC: An alternative CAP formulation
• Daniel Abadi's Consistency Tradeoffs in Modern Distributed Database
System Design proposes an alternative: PACELC.
• Idea: Classify systems according to their behavior during network partitions
• if there is a partition (P), how does the system trade off availability and consistency (A
and C); else (E), when the system is running normally in the absence of partitions, how
does the system trade off latency (L) and consistency (C)?
Copyright © William El Kaim 2016 26Source: Felix Gessert
Plan
• Why the Need for Databases That Scales High?
• What is NoSQL?
• NoSQL Database Taxonomy
• From CAP to PACELC
NoSQL Database Comparisons and Decision Tree
• NoSQL Data Modeling Techniques
• Resources
Copyright © William El Kaim 2016 27
NoSQL Comparisons
Source: Ben Scofielfd
Copyright © William El Kaim 2016 28Source: PWC
Form of Data Normalization
Source: Lost … sorryCopyright © William El Kaim 2016 29
Database Evolution History
Copyright © William El Kaim 2016 Source: Robin Purohit 30
NoSQL Database Evolution History
31Copyright © William El Kaim 2016 Source: Felix Gessert
Copyright © William El Kaim 2016 32
2016 Forrester Waves
33Copyright © William El Kaim 2016 Source: Forrester
2015 Magic Quadrant for Operational Database
Management Systems
34Copyright © William El Kaim 2016
NoSQLDecisionTree
35Copyright © William El Kaim 2016 Source: Felix Gessert
36Copyright © William El Kaim 2016
Plan
• Why the Need for Databases That Scales High?
• What is NoSQL?
• NoSQL Database Taxonomy
• From CAP to PACELC
• NoSQL Database Comparisons and Decision Tree
NoSQL Data Modeling Techniques
• Resources
Copyright © William El Kaim 2016 37
NoSQL Data Modeling TechniquesIntroduction
• NoSQL data modeling often starts from the application-specific queries as
opposed to relational modeling:
• Relational modeling is typically driven by the structure of available data. The main
design theme is “What answers do I have?”
• NoSQL data modeling is typically driven by application-specific access patterns, i.e. the
types of queries to be supported. The main design theme is “What questions do I
have?”
• NoSQL data modeling often requires a deeper understanding of
data structures and algorithms than relational database modeling does.
• Data duplication and denormalization are first-class citizens.
• Relational databases are not very convenient for hierarchical or graph-like
data modeling and processing.
• Graph databases are obviously a perfect solution for this area, but actually most of
NoSQL solutions are surprisingly strong for such problems.
Copyright © William El Kaim 2016 38Source: Highly Scalable Blog
NoSQL Data Modeling TechniquesDenormalization
• Denormalization can be defined as the copying of the same data into multiple documents or tables in order to simplify/optimize query processing or to fit the user’s data into a particular data model.
• In general, denormalization is helpful for the following trade-offs:• Query data volume or IO per query VS total data volume.
• Using denormalization one can group all data that is needed to process a query in one place.
• This often means that for different query flows the same data will be accessed in different combinations. Hence we need to duplicate data, which increases total data volume.
• Processing complexity VS total data volume.
• Modeling-time normalization and consequent query-time joins obviously increase complexity of the query processor, especially in distributed systems.
• Denormalization allow one to store data in a query-friendly structure to simplify query processing.
• Applicability• Key-Value Stores, Document Databases, BigTable-style Databases
Copyright © William El Kaim 2016 39Source: Highly Scalable Blog
NoSQL Data Modeling TechniquesAggregates
• All major genres of NoSQL provide soft schema capabilities:• Key-Value Stores and Graph Databases typically do not place constraints on values, so
values can be comprised of arbitrary format.
• BigTable models support soft schema via a variable set of columns within a column family and a variable number of versions for one cell.
• Document databases are inherently schema-less, although some of them allow one to validate incoming data using a user-defined schema.
• Objective is to form classes of entities with complex internal structures (nested entities) and vary the structure of particular entities in order to :• Minimize one-to-many relationships by means of nested entities and, consequently,
reduction of joins.
• Mask “technical” differences between business entities and model heterogeneous business entities using one collection of documents or one table.
• Applicability• Key-Value Stores, Document Databases, BigTable-style Databases
Copyright © William El Kaim 2016 40Source: Highly Scalable Blog
NoSQL Data Modeling TechniquesAggregates example
41Copyright © William El Kaim 2016 Source: Highly Scalable Blog
NoSQL Data Modeling TechniquesApplication Side Joins
• Joins are rarely supported in NoSQL solutions.
• Joins are then often handled at design time as opposed to relational models where joins
are handled at query execution time.
• Query time joins almost always mean a performance penalty, but in many cases one can
avoid joins using Denormalization and Aggregates, i.e. embedding nested entities.
• Of course, in many cases joins are inevitable and should be handled by an
application when:
• Many to many relationships are often modeled by links and require joins.
• Aggregates are often inapplicable when entity internals are the subject of frequent
modifications. It is usually better to keep a record that something happened and join the
records at query time as opposed to changing a value .
• Applicability
• Key-Value Stores, Document Databases, BigTable-style Databases, Graph Databases
Copyright © William El Kaim 2016 42Source: Highly Scalable Blog
NoSQL Data Modeling Techniques Others
• Atomic Aggregates• Many, although not all, NoSQL solutions have limited transaction support. It is common
to model data using an Aggregates technique to guarantee some of the ACID properties.
• Aggregates allow one to store a single business entity as one document, row or key-value pair and update it atomically (instead of normalized data that typically require multi-place updates)
• Applicability
• Key-Value Stores, Document Databases, BigTable-style Databases
• Enumerable Keys• Perhaps the greatest benefit of an unordered Key-Value data model is that entries can
be partitioned across multiple servers by just hashing the key.
• Sorting makes things more complex, but sometimes an application is able to take some advantages of ordered keys even if storage doesn’t offer such a feature.
• Applicability
• Key-Value Stores
• More Here
Copyright © William El Kaim 2016 43Source: Highly Scalable Blog
Plan
• Why the Need for Databases That Scales High?
• What is NoSQL?
• NoSQL Database Taxonomy
• From CAP to PACELC
• NoSQL Database Comparisons and Decision Tree
• NoSQL Data Modeling Techniques
Resources
Copyright © William El Kaim 2016 44
Key Papers
1. The Amazon Dynamo paper is classic. Almost everyone in the NoSQL world
has read this paper.
2. Google's Bigtable paper.
3. Werner Vogels's "Eventually Consistent" (originally published in ACM
Queue)
4. Brewer's CAP Theorem (a foundational bit of scalability theory) is well-
explained here. Also see Brewer's original slides from his famous July 2000
PODC keynote.
5. The slideshows from the June 11, 2009 NoSQL meetup in SFO.
Copyright © William El Kaim 2016 45
Resources
• Excellent presentation from Felix Gessert
• Martin Fowler NoSQL dedicated site
• PWC Technology Forecast: “Remapping the database landscape”
• Highly Scalable Blog: “NoSQL data modeling techniques”
• Introduction NoSQL by Laurent Broudoux (French)
• NoSQLDatabase.org: "Your Ultimate Guide to the Non-Relational Universe!"
Copyright © William El Kaim 2016 46
Copyright © William El Kaim 2016
http://www.twitter.com/welkaim
SlideShare
http://www.slideshare.net/welkaim
EA Digital Codex
http://www.eacodex.com/
http://fr.linkedin.com/in/williamelkaim
Claudine O'Sullivan47