Intro to Big Data - cibtrc.com€¦ · ACID Theorem 5 • Atomic: All of the work in a transaction...
Transcript of Intro to Big Data - cibtrc.com€¦ · ACID Theorem 5 • Atomic: All of the work in a transaction...
NoSQL Database
By: Shahab Safaee & Morteza Zahedi
Software Engineering PhD
Email: [email protected] , [email protected]
cibtrc.ir
cibtrc
cibtrc
Agenda • Some history
• Relational databases
• ACID Theorem
• Scaling Up
• Distributed Database Systems
• CAP Theorem
• What is NoSQL?
• BASE Transactions
• NoSQL Types
• Some Statistics
• NoSQL vs. SQL Summery
2
A brief history of databases
3
Relational databases
4
• Benefits of Relational databases: ▫ Designed for all purposes
▫ ACID
▫ Strong consistency, concurrency, recovery
▫ Mathematical background
▫ Standard Query language (SQL)
▫ Lots of tools to use with i.e: Reporting, services, entity frameworks, ...
▫ Vertical scaling (up scaling)
ACID Theorem
5
• Atomic: ▫ All of the work in a transaction completes (commit) or none of it completes ▫ All operations in a transaction succeed or every operation is rolled back.
• Consistent: ▫ A transaction transforms the database from one consistent state to another
consistent state. Consistency is defined in terms of constraints. ▫ On the completion of a transaction, the database is structurally sound.
• Isolated: ▫ The results of any changes made during a transaction are not visible until the
transaction has committed. ▫ Transactions do not contend with one another. Contentious access to data is
moderated by the database so that transactions appear to run sequentially.
• Durable: ▫ The results of a committed transaction survive failures ▫ The results of applying a transaction are permanent, even in the presence of
failures.
Era of Distributed Computing
6
But...
• Relational databases were not built for distributed applications.
Because...
• Joins are expensive
• Hard to scale horizontally
• Impedance mismatch occurs
• Expensive (product cost, hardware , Maintenance)
Era of Distributed Computing
7
But... • Relational databases were not built for
distributed applications.
Because... • Joins are expensive • Hard to scale horizontally • Impedance mismatch occurs • Expensive (product cost, hardware
, Maintenance) And … It‟s weak in: • Speed (performance) • High availability • Partition tolerance
Scaling Up
8
• Issues with scaling up when the dataset is just too big
• RDBMS were not designed to be distributed
• Began to look at multi-node database solutions
• Known as „scaling out‟ or „horizontal scaling‟
• Different approaches include:
▫ Master-slave
▫ Sharding
Scaling RDBMS – Master/Slave
9
• Master-Slave
▫ All writes are written to the master. All reads performed against the replicated slave databases
▫ Critical reads may be incorrect as writes may not have been propagated down
▫ Large data sets can pose problems as master needs to duplicate data to slaves
Scaling RDBMS - Sharding
10
• Partition or sharding
▫ Scales well for both reads and writes
▫ Not transparent, application needs to be partition-aware
▫ Can no longer have relationships/joins across partitions
▫ Loss of referential integrity across shards
Sharding Advantages
• Tables are divided and distributed into multiple servers
• Reduces index size, which generally improves search performance
• A database shard can be placed on separate hardware
• greatly improving performance
• if the database shard is based on some real-world segmentation of the data then it may be possible to infer the appropriate shard membership easily and automatically
11
Other ways to scale RDBMS
12
• Multi-Master replication
• INSERT only, not UPDATES/DELETES
• No JOINs, thereby reducing query time
▫ This involves de-normalizing data
• In-memory databases
What we need?
13
We need a distributed database system having such features: • High Concurrency • High Availability • Fault tolerance • High Scalability • Low latency • Efficient Storage • Reduce Manage and Operation Cost
Which is impossible!!!
According to CAP theorem
Distributed Database Systems
14
• Data is stored across several sites that share no physical component.
• Systems that run on each site are independent of each other.
• Appears to user as a single system.
Distributed Data Storage
15
• Partitioning :
▫ Data is partitioned into several fragments and stored in different sites.
▫ Horizontal – by rows.
▫ Vertical – by columns.
• Replication : ▫ System maintains multiple
copies of data, stored in different sites.
Replication and Partitioning can be combined !
Partitioning
16
• Locality of reference – data is most likely to be updated and queried locally.
Replication
17
• Pros – Increased availability of data and faster query evaluation. • Cons – Increased cost of updates and complexity of concurrency
control.
CAP Theorem
18
• In 2000, Berkeley, CA, researcher Eric Brewer published his now foundational CAP Theorem
▫ (consistency, availability and partition tolerance)
• which states that it is impossible for a distributed
computer system to simultaneously provide all three CAP guarantees.
• In May 2012, Brewer clarified some of his positions on the oft-used “two out of three” concept.
CAP Theorem
19
• Consistency: ▫ all nodes see the same data at the same time
• Availability:
▫ a guarantee that every request receives a response about whether it was successful or failed
• Partition tolerance: ▫ the system continues to operate despite arbitrary message loss or
failure of part of the system
CAP Theorem
20
CAP – 2 of 3
21
• If there are no partitions, it is clearly possible to provide consistent, available data (e.g. read-any write-all).Best-effort availability:
• Examples: ▫ RDBMs
CAP – 2 of 3
22
• Trivial:
▫ The trivial system that ignores all requests meets these requirements.
• Best-effort availability:
▫ Read-any write-all systems will become unavailable only when messages are lost.
• Examples: ▫ Distributed database systems, BigTable
CAP – 2 of 3
23
• Trivial:
▫ The service can trivially return the initial value in response to every request.
• Best-effort consistency:
▫ Quorum-based system, modified to time-out lost messages, will only return inconsistent(and, in particular, stale) data when messages are lost.
• Examples: ▫ Web cashes, Dynamo
What is NoSQL?
24
• Stands for Not Only SQL
▫ Term was redefined by Eric Evans after Carlo Strozzi.
• Class of non-relational data storage systems
• Usually do not require a fixed table schema nor do they use the concept of joins
• All NoSQL offerings relax one or more of the ACID properties (Based on the CAP theorem)
NoSQL Definition
25
From www.nosql-database.org:
• Next Generation Databases mostly addressing some of the points: ▫ being non-relational
▫ distributed
▫ open-source
▫ horizontal scalable.
• The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly.
• Often more characteristics apply as: ▫ schema-free
▫ easy replication support
▫ simple API
▫ eventually consistent / BASE (not ACID)
▫ a huge data amount, and more.
NOSQL Common concepts
• Map-reduce
• Sharding
• MVCC
• Consistent hashing
• Vector clocks
26
Consistent hashing
• The idea behind consistent hashing is to use the same hash
function for both the object hashing and the node hashing.
27
Vector clocks • Alice, Ben, Cathy, and Dave are planning to meet next week
28
date = Wednesday vclock = Alice:1
date = Tuesday vclock = Alice:1, Ben:1, Dave:1
date = Thursday vclock = Alice:1, Ben:1, Cathy:1, Dave:2
NoSQL Distinguishing
Characteristics • Large data volumes
▫ Google‟s “big data” • Scalable replication and distribution
▫ Potentially thousands of machines ▫ Potentially distributed around the world
• Queries need to return answers quickly • Mostly query, few updates • Asynchronous Inserts & Updates • Schema-less • ACID transaction properties are not needed – BASE • Open source development • Weaker Concurrency Model than ACID • Efficient use of distributed indexes and RAM
29
BASE Transactions
• Acronym contrived to be the opposite of ACID ▫ Basically Available
The database appears to work most of the time (Replication and Sharding Mechanisms).
▫ Soft state Stores don‟t have to be write-consistent, nor do different replicas have to
be mutually consistent all the time. Consistency guaranty with Application Developer.
▫ Eventually Consistent Stores exhibit consistency at some later point (e.g., lazily at read time). Guaranties consistency only at undefended future time.
• Characteristics ▫ Weak consistency ▫ Availability first ▫ Optimistic ▫ Simpler and faster
30
BASE vs ACID
• ACID: ▫ Strong Consistency ▫ Less Availability ▫ Pessimistic Concurrency ▫ Complex
• BASE ▫ Availability is the most important thing. ▫ Weaker consistency (Eventual) ▫ Simple and Fast ▫ Optimistic
31
CAP Theorem with ACID and BASE
Visualized
32
How did we get here?
33
• Explosion of social media sites (Facebook, Twitter) with large data needs
• Rise of cloud-based solutions such as Amazon S3 (simple storage solution)
• Just as moving to dynamically-typed languages (Ruby/Groovy), a shift to dynamically-typed data with frequent schema changes
• Open-source community
NoSQL Types
34
No SQL database are classified into four types:
• Key Value pair based
• Column based
• Document based
• Graph based
Key Value Pair Based
35
• Designed for processing dictionary. Dictionaries contain a collection of records having fields containing data.
• Records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data with in the database.
Example:
CouchDB, Oracle NoSQL Database, Riak etc.
• We use it for storing session information, user profiles, shopping cart data.
• We would avoid it when we need to query data having relationships between entities.
Column based
36
• It store data as Column families containing rows that have many columns associated with a row key. Each row can have different columns.
• Column families are groups of related data that is accessed together.
Example:
Cassandra, HBase, Hypertable, and Amazon DynamoDB.
• We use it for content management systems, blogging platforms, log aggregation.
• We would avoid it for systems that are in early development, changing query patterns.
Document Based
37
• The database stores and retrieves documents. It stores documents in the value part of the key-value store.
• Self-describing, hierarchical tree data structures consisting of maps, collections, and scalar values.
Example:
LotusNotes, MongoDB, CouchDB, OrientDB, RavenDB.
• We use it for content management systems, blogging platforms, web analytics, real-time analytics, e-commerce applications.
• We would avoid it for systems that need complex transactions spanning multiple operations or queries against varying aggregate structures.
Graph Based
38
• Store entities and relationships between these entities as nodes and edges of a graph respectively. Entities have properties.
• Traversing the relationships is very fast as relationship between nodes is not calculated at query time but is actually persisted as a relationship.
Example:
Neo4J, Infinite Graph, OrientDB, FlockDB.
• It is well suited for connected data, such as social networks, spatial data, routing information for goods and supply.
Top 10 of NoSQL DB with Data
Models
39
Common Advantages
40
• Cheap, easy to implement (open source) • Data are replicated to multiple nodes (therefore
identical and fault-tolerant) and can be partitioned ▫ Down nodes easily replaced
▫ No single point of failure
• Easy to distribute • Don't require a schema • Can scale up and down • Relax the data consistency requirement (CAP)
What is not provided by NoSQL
41
• Joins
• Group by
• ACID transactions
• SQL
• Integration with applications that are based on SQL
Some Statistics
42
• Facebook Search
• MySQL > 50 GB Data
▫ Writes Average : ~300 ms
▫ Reads Average : ~350 ms
• Rewritten with Cassandra > 50 GB Data
▫ Writes Average : 0.12 ms
▫ Reads Average : 15 ms
Don’t forget about the DBA
43
• It does not matter if the data is deployed on a NoSQL platform instead of an RDBMS.
• Still need to address: ▫ Backups & recovery
▫ Capacity planning
▫ Performance monitoring
▫ Data integration
▫ Tuning & optimization
• What happens when things don‟t work as expected and nodes are out of sync or you have a data corruption occurring at 2am?
• Who you gonna call? ▫ DBA and SysAdmin need to be on board
NoSQL vs. SQL Summery
44
NoSQL vs. SQL Summery Features
45
Visual Guide to NoSQL Systems
46
Most popular DBMS
47
Trend Popularity
48
Ranking of Key-value Stores
49
Ranking of Document Stores
50
Ranking of Graph DBMS
51
Ranking scores per category in
percent, July 2018
52
Popularity changes per category,
July 2018
53
Critical Reception?
• Some people consider that companies do not miss anything if they do not switch to NoSQL databases and if a relational DBMS does its job.
Blinding performance depends on removing overheads in database systems (logging/locking/lathcing/buffer management).
• some critics look at NoSQL databases as nothing new as compared to other attempts like object databases which have been around for decades.
54
Summary
55
Reference
56
• http://nosql-database.org/
• http://wikibon.org/wiki/v/21_NoSQL_Innovators_to_Look_for_in_2020#Introduction
• https://db-engines.com
• http://basho.com/posts/technical/why-vector-clocks-are-easy/
• …
57