Intro to Big Data - cibtrc.com€¦ · ACID Theorem 5 • Atomic: All of the work in a transaction...

Post on 27-May-2020

4 views 0 download

Transcript of Intro to Big Data - cibtrc.com€¦ · ACID Theorem 5 • Atomic: All of the work in a transaction...

NoSQL Database

By: Shahab Safaee & Morteza Zahedi

Software Engineering PhD

Email: safaee.shx@gmail.com , morteza.zahedi.a@gmail.com

cibtrc.ir

cibtrc

cibtrc

Agenda • Some history

• Relational databases

• ACID Theorem

• Scaling Up

• Distributed Database Systems

• CAP Theorem

• What is NoSQL?

• BASE Transactions

• NoSQL Types

• Some Statistics

• NoSQL vs. SQL Summery

2

A brief history of databases

3

Relational databases

4

• Benefits of Relational databases: ▫ Designed for all purposes

▫ ACID

▫ Strong consistency, concurrency, recovery

▫ Mathematical background

▫ Standard Query language (SQL)

▫ Lots of tools to use with i.e: Reporting, services, entity frameworks, ...

▫ Vertical scaling (up scaling)

ACID Theorem

5

• Atomic: ▫ All of the work in a transaction completes (commit) or none of it completes ▫ All operations in a transaction succeed or every operation is rolled back.

• Consistent: ▫ A transaction transforms the database from one consistent state to another

consistent state. Consistency is defined in terms of constraints. ▫ On the completion of a transaction, the database is structurally sound.

• Isolated: ▫ The results of any changes made during a transaction are not visible until the

transaction has committed. ▫ Transactions do not contend with one another. Contentious access to data is

moderated by the database so that transactions appear to run sequentially.

• Durable: ▫ The results of a committed transaction survive failures ▫ The results of applying a transaction are permanent, even in the presence of

failures.

Era of Distributed Computing

6

But...

• Relational databases were not built for distributed applications.

Because...

• Joins are expensive

• Hard to scale horizontally

• Impedance mismatch occurs

• Expensive (product cost, hardware , Maintenance)

Era of Distributed Computing

7

But... • Relational databases were not built for

distributed applications.

Because... • Joins are expensive • Hard to scale horizontally • Impedance mismatch occurs • Expensive (product cost, hardware

, Maintenance) And … It‟s weak in: • Speed (performance) • High availability • Partition tolerance

Scaling Up

8

• Issues with scaling up when the dataset is just too big

• RDBMS were not designed to be distributed

• Began to look at multi-node database solutions

• Known as „scaling out‟ or „horizontal scaling‟

• Different approaches include:

▫ Master-slave

▫ Sharding

Scaling RDBMS – Master/Slave

9

• Master-Slave

▫ All writes are written to the master. All reads performed against the replicated slave databases

▫ Critical reads may be incorrect as writes may not have been propagated down

▫ Large data sets can pose problems as master needs to duplicate data to slaves

Scaling RDBMS - Sharding

10

• Partition or sharding

▫ Scales well for both reads and writes

▫ Not transparent, application needs to be partition-aware

▫ Can no longer have relationships/joins across partitions

▫ Loss of referential integrity across shards

Sharding Advantages

• Tables are divided and distributed into multiple servers

• Reduces index size, which generally improves search performance

• A database shard can be placed on separate hardware

• greatly improving performance

• if the database shard is based on some real-world segmentation of the data then it may be possible to infer the appropriate shard membership easily and automatically

11

Other ways to scale RDBMS

12

• Multi-Master replication

• INSERT only, not UPDATES/DELETES

• No JOINs, thereby reducing query time

▫ This involves de-normalizing data

• In-memory databases

What we need?

13

We need a distributed database system having such features: • High Concurrency • High Availability • Fault tolerance • High Scalability • Low latency • Efficient Storage • Reduce Manage and Operation Cost

Which is impossible!!!

According to CAP theorem

Distributed Database Systems

14

• Data is stored across several sites that share no physical component.

• Systems that run on each site are independent of each other.

• Appears to user as a single system.

Distributed Data Storage

15

• Partitioning :

▫ Data is partitioned into several fragments and stored in different sites.

▫ Horizontal – by rows.

▫ Vertical – by columns.

• Replication : ▫ System maintains multiple

copies of data, stored in different sites.

Replication and Partitioning can be combined !

Partitioning

16

• Locality of reference – data is most likely to be updated and queried locally.

Replication

17

• Pros – Increased availability of data and faster query evaluation. • Cons – Increased cost of updates and complexity of concurrency

control.

CAP Theorem

18

• In 2000, Berkeley, CA, researcher Eric Brewer published his now foundational CAP Theorem

▫ (consistency, availability and partition tolerance)

• which states that it is impossible for a distributed

computer system to simultaneously provide all three CAP guarantees.

• In May 2012, Brewer clarified some of his positions on the oft-used “two out of three” concept.

CAP Theorem

19

• Consistency: ▫ all nodes see the same data at the same time

• Availability:

▫ a guarantee that every request receives a response about whether it was successful or failed

• Partition tolerance: ▫ the system continues to operate despite arbitrary message loss or

failure of part of the system

CAP Theorem

20

CAP – 2 of 3

21

• If there are no partitions, it is clearly possible to provide consistent, available data (e.g. read-any write-all).Best-effort availability:

• Examples: ▫ RDBMs

CAP – 2 of 3

22

• Trivial:

▫ The trivial system that ignores all requests meets these requirements.

• Best-effort availability:

▫ Read-any write-all systems will become unavailable only when messages are lost.

• Examples: ▫ Distributed database systems, BigTable

CAP – 2 of 3

23

• Trivial:

▫ The service can trivially return the initial value in response to every request.

• Best-effort consistency:

▫ Quorum-based system, modified to time-out lost messages, will only return inconsistent(and, in particular, stale) data when messages are lost.

• Examples: ▫ Web cashes, Dynamo

What is NoSQL?

24

• Stands for Not Only SQL

▫ Term was redefined by Eric Evans after Carlo Strozzi.

• Class of non-relational data storage systems

• Usually do not require a fixed table schema nor do they use the concept of joins

• All NoSQL offerings relax one or more of the ACID properties (Based on the CAP theorem)

NoSQL Definition

25

From www.nosql-database.org:

• Next Generation Databases mostly addressing some of the points: ▫ being non-relational

▫ distributed

▫ open-source

▫ horizontal scalable.

• The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly.

• Often more characteristics apply as: ▫ schema-free

▫ easy replication support

▫ simple API

▫ eventually consistent / BASE (not ACID)

▫ a huge data amount, and more.

NOSQL Common concepts

• Map-reduce

• Sharding

• MVCC

• Consistent hashing

• Vector clocks

26

Consistent hashing

• The idea behind consistent hashing is to use the same hash

function for both the object hashing and the node hashing.

27

Vector clocks • Alice, Ben, Cathy, and Dave are planning to meet next week

28

date = Wednesday vclock = Alice:1

date = Tuesday vclock = Alice:1, Ben:1, Dave:1

date = Thursday vclock = Alice:1, Ben:1, Cathy:1, Dave:2

NoSQL Distinguishing

Characteristics • Large data volumes

▫ Google‟s “big data” • Scalable replication and distribution

▫ Potentially thousands of machines ▫ Potentially distributed around the world

• Queries need to return answers quickly • Mostly query, few updates • Asynchronous Inserts & Updates • Schema-less • ACID transaction properties are not needed – BASE • Open source development • Weaker Concurrency Model than ACID • Efficient use of distributed indexes and RAM

29

BASE Transactions

• Acronym contrived to be the opposite of ACID ▫ Basically Available

The database appears to work most of the time (Replication and Sharding Mechanisms).

▫ Soft state Stores don‟t have to be write-consistent, nor do different replicas have to

be mutually consistent all the time. Consistency guaranty with Application Developer.

▫ Eventually Consistent Stores exhibit consistency at some later point (e.g., lazily at read time). Guaranties consistency only at undefended future time.

• Characteristics ▫ Weak consistency ▫ Availability first ▫ Optimistic ▫ Simpler and faster

30

BASE vs ACID

• ACID: ▫ Strong Consistency ▫ Less Availability ▫ Pessimistic Concurrency ▫ Complex

• BASE ▫ Availability is the most important thing. ▫ Weaker consistency (Eventual) ▫ Simple and Fast ▫ Optimistic

31

CAP Theorem with ACID and BASE

Visualized

32

How did we get here?

33

• Explosion of social media sites (Facebook, Twitter) with large data needs

• Rise of cloud-based solutions such as Amazon S3 (simple storage solution)

• Just as moving to dynamically-typed languages (Ruby/Groovy), a shift to dynamically-typed data with frequent schema changes

• Open-source community

NoSQL Types

34

No SQL database are classified into four types:

• Key Value pair based

• Column based

• Document based

• Graph based

Key Value Pair Based

35

• Designed for processing dictionary. Dictionaries contain a collection of records having fields containing data.

• Records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data with in the database.

Example:

CouchDB, Oracle NoSQL Database, Riak etc.

• We use it for storing session information, user profiles, shopping cart data.

• We would avoid it when we need to query data having relationships between entities.

Column based

36

• It store data as Column families containing rows that have many columns associated with a row key. Each row can have different columns.

• Column families are groups of related data that is accessed together.

Example:

Cassandra, HBase, Hypertable, and Amazon DynamoDB.

• We use it for content management systems, blogging platforms, log aggregation.

• We would avoid it for systems that are in early development, changing query patterns.

Document Based

37

• The database stores and retrieves documents. It stores documents in the value part of the key-value store.

• Self-describing, hierarchical tree data structures consisting of maps, collections, and scalar values.

Example:

LotusNotes, MongoDB, CouchDB, OrientDB, RavenDB.

• We use it for content management systems, blogging platforms, web analytics, real-time analytics, e-commerce applications.

• We would avoid it for systems that need complex transactions spanning multiple operations or queries against varying aggregate structures.

Graph Based

38

• Store entities and relationships between these entities as nodes and edges of a graph respectively. Entities have properties.

• Traversing the relationships is very fast as relationship between nodes is not calculated at query time but is actually persisted as a relationship.

Example:

Neo4J, Infinite Graph, OrientDB, FlockDB.

• It is well suited for connected data, such as social networks, spatial data, routing information for goods and supply.

Top 10 of NoSQL DB with Data

Models

39

Common Advantages

40

• Cheap, easy to implement (open source) • Data are replicated to multiple nodes (therefore

identical and fault-tolerant) and can be partitioned ▫ Down nodes easily replaced

▫ No single point of failure

• Easy to distribute • Don't require a schema • Can scale up and down • Relax the data consistency requirement (CAP)

What is not provided by NoSQL

41

• Joins

• Group by

• ACID transactions

• SQL

• Integration with applications that are based on SQL

Some Statistics

42

• Facebook Search

• MySQL > 50 GB Data

▫ Writes Average : ~300 ms

▫ Reads Average : ~350 ms

• Rewritten with Cassandra > 50 GB Data

▫ Writes Average : 0.12 ms

▫ Reads Average : 15 ms

Don’t forget about the DBA

43

• It does not matter if the data is deployed on a NoSQL platform instead of an RDBMS.

• Still need to address: ▫ Backups & recovery

▫ Capacity planning

▫ Performance monitoring

▫ Data integration

▫ Tuning & optimization

• What happens when things don‟t work as expected and nodes are out of sync or you have a data corruption occurring at 2am?

• Who you gonna call? ▫ DBA and SysAdmin need to be on board

NoSQL vs. SQL Summery

44

NoSQL vs. SQL Summery Features

45

Visual Guide to NoSQL Systems

46

Most popular DBMS

47

Trend Popularity

48

Ranking of Key-value Stores

49

Ranking of Document Stores

50

Ranking of Graph DBMS

51

Ranking scores per category in

percent, July 2018

52

Popularity changes per category,

July 2018

53

Critical Reception?

• Some people consider that companies do not miss anything if they do not switch to NoSQL databases and if a relational DBMS does its job.

Blinding performance depends on removing overheads in database systems (logging/locking/lathcing/buffer management).

• some critics look at NoSQL databases as nothing new as compared to other attempts like object databases which have been around for decades.

54

Summary

55

57