Intro to Big Data - cibtrc.com€¦ · ACID Theorem 5 • Atomic: All of the work in a transaction...

NoSQL Database

By: Shahab Safaee & Morteza Zahedi

Software Engineering PhD

Email: [email protected] , [email protected]

cibtrc.ir

cibtrc

cibtrc

mailto:[email protected]

mailto:[email protected]

Agenda • Some history

• Relational databases

• ACID Theorem

• Scaling Up

• Distributed Database Systems

• CAP Theorem

• What is NoSQL?

• BASE Transactions

• NoSQL Types

• Some Statistics

• NoSQL vs. SQL Summery

2

A brief history of databases

3

Relational databases

4

• Benefits of Relational databases: ▫ Designed for all purposes

▫ ACID

▫ Strong consistency, concurrency, recovery

▫ Mathematical background

▫ Standard Query language (SQL)

▫ Lots of tools to use with i.e: Reporting, services, entity frameworks, ...

▫ Vertical scaling (up scaling)

ACID Theorem

5

• Atomic: ▫ All of the work in a transaction completes (commit) or none of it completes ▫ All operations in a transaction succeed or every operation is rolled back.

• Consistent: ▫ A transaction transforms the database from one consistent state to another

consistent state. Consistency is defined in terms of constraints. ▫ On the completion of a transaction, the database is structurally sound.

• Isolated: ▫ The results of any changes made during a transaction are not visible until the

transaction has committed. ▫ Transactions do not contend with one another. Contentious access to data is

moderated by the database so that transactions appear to run sequentially.

• Durable: ▫ The results of a committed transaction survive failures ▫ The results of applying a transaction are permanent, even in the presence of

failures.

Era of Distributed Computing

6

But...

• Relational databases were not built for distributed applications.

Because...

• Joins are expensive

• Hard to scale horizontally

• Impedance mismatch occurs

• Expensive (product cost, hardware , Maintenance)

Era of Distributed Computing

7

But... • Relational databases were not built for

distributed applications.

Because... • Joins are expensive • Hard to scale horizontally • Impedance mismatch occurs • Expensive (product cost, hardware

, Maintenance) And … It‟s weak in: • Speed (performance) • High availability • Partition tolerance

Scaling Up

8

• Issues with scaling up when the dataset is just too big

• RDBMS were not designed to be distributed

• Began to look at multi-node database solutions

• Known as „scaling out‟ or „horizontal scaling‟

• Different approaches include:

▫ Master-slave

▫ Sharding

Scaling RDBMS – Master/Slave

9

• Master-Slave

▫ All writes are written to the master. All reads performed against the replicated slave databases

▫ Critical reads may be incorrect as writes may not have been propagated down

▫ Large data sets can pose problems as master needs to duplicate data to slaves

Scaling RDBMS - Sharding

10

• Partition or sharding

▫ Scales well for both reads and writes

▫ Not transparent, application needs to be partition-aware

▫ Can no longer have relationships/joins across partitions

▫ Loss of referential integrity across shards

Sharding Advantages

• Tables are divided and distributed into multiple servers

• Reduces index size, which generally improves search performance

• A database shard can be placed on separate hardware

• greatly improving performance

• if the database shard is based on some real-world segmentation of the data then it may be possible to infer the appropriate shard membership easily and automatically

11

https://en.wikipedia.org/wiki/Index_(database)

Other ways to scale RDBMS

12

• Multi-Master replication

• INSERT only, not UPDATES/DELETES

• No JOINs, thereby reducing query time

▫ This involves de-normalizing data

• In-memory databases

What we need?

13

We need a distributed database system having such features: • High Concurrency • High Availability • Fault tolerance • High Scalability • Low latency • Efficient Storage • Reduce Manage and Operation Cost

Which is impossible!!!

According to CAP theorem

Distributed Database Systems

14

• Data is stored across several sites that share no physical component.

• Systems that run on each site are independent of each other.

• Appears to user as a single system.

Distributed Data Storage

15

• Partitioning :

▫ Data is partitioned into several fragments and stored in different sites.

▫ Horizontal – by rows.

▫ Vertical – by columns.

• Replication : ▫ System maintains multiple

copies of data, stored in different sites.

Replication and Partitioning can be combined !

Partitioning

16

• Locality of reference – data is most likely to be updated and queried locally.

Replication

17

• Pros – Increased availability of data and faster query evaluation. • Cons – Increased cost of updates and complexity of concurrency

control.

CAP Theorem

18

• In 2000, Berkeley, CA, researcher Eric Brewer published his now foundational CAP Theorem

▫ (consistency, availability and partition tolerance)

• which states that it is impossible for a distributed

computer system to simultaneously provide all three CAP guarantees.

• In May 2012, Brewer clarified some of his positions on the oft-used “two out of three” concept.

http://en.wikipedia.org/wiki/CAP_theorem

http://en.wikipedia.org/wiki/CAP_theorem

CAP Theorem

19

• Consistency: ▫ all nodes see the same data at the same time

• Availability:

▫ a guarantee that every request receives a response about whether it was successful or failed

• Partition tolerance: ▫ the system continues to operate despite arbitrary message loss or

failure of part of the system

CAP Theorem

20

CAP – 2 of 3

21

• If there are no partitions, it is clearly possible to provide consistent, available data (e.g. read-any write-all).Best-effort availability:

• Examples: ▫ RDBMs

CAP – 2 of 3

22

• Trivial:

▫ The trivial system that ignores all requests meets these requirements.

• Best-effort availability:

▫ Read-any write-all systems will become unavailable only when messages are lost.

• Examples: ▫ Distributed database systems, BigTable

CAP – 2 of 3

23

• Trivial:

▫ The service can trivially return the initial value in response to every request.

• Best-effort consistency:

▫ Quorum-based system, modified to time-out lost messages, will only return inconsistent(and, in particular, stale) data when messages are lost.

• Examples: ▫ Web cashes, Dynamo

What is NoSQL?

24

• Stands for Not Only SQL

▫ Term was redefined by Eric Evans after Carlo Strozzi.

• Class of non-relational data storage systems

• Usually do not require a fixed table schema nor do they use the concept of joins

• All NoSQL offerings relax one or more of the ACID properties (Based on the CAP theorem)

NoSQL Definition

25

From www.nosql-database.org:

• Next Generation Databases mostly addressing some of the points: ▫ being non-relational

▫ distributed

▫ open-source

▫ horizontal scalable.

• The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly.

• Often more characteristics apply as: ▫ schema-free

▫ easy replication support

▫ simple API

▫ eventually consistent / BASE (not ACID)

▫ a huge data amount, and more.

NOSQL Common concepts

• Map-reduce

• Sharding

• MVCC

• Consistent hashing

• Vector clocks

26

Consistent hashing

• The idea behind consistent hashing is to use the same hash

function for both the object hashing and the node hashing.

27

Vector clocks • Alice, Ben, Cathy, and Dave are planning to meet next week

28

date = Wednesday vclock = Alice:1

date = Tuesday vclock = Alice:1, Ben:1, Dave:1

date = Thursday vclock = Alice:1, Ben:1, Cathy:1, Dave:2

NoSQL Distinguishing

Characteristics • Large data volumes

▫ Google‟s “big data” • Scalable replication and distribution

▫ Potentially thousands of machines ▫ Potentially distributed around the world

• Queries need to return answers quickly • Mostly query, few updates • Asynchronous Inserts & Updates • Schema-less • ACID transaction properties are not needed – BASE • Open source development • Weaker Concurrency Model than ACID • Efficient use of distributed indexes and RAM

29

BASE Transactions

• Acronym contrived to be the opposite of ACID ▫ Basically Available

The database appears to work most of the time (Replication and Sharding Mechanisms).

▫ Soft state Stores don‟t have to be write-consistent, nor do different replicas have to

be mutually consistent all the time. Consistency guaranty with Application Developer.

▫ Eventually Consistent Stores exhibit consistency at some later point (e.g., lazily at read time). Guaranties consistency only at undefended future time.

• Characteristics ▫ Weak consistency ▫ Availability first ▫ Optimistic ▫ Simpler and faster

30

BASE vs ACID

• ACID: ▫ Strong Consistency ▫ Less Availability ▫ Pessimistic Concurrency ▫ Complex

• BASE ▫ Availability is the most important thing. ▫ Weaker consistency (Eventual) ▫ Simple and Fast ▫ Optimistic

31

CAP Theorem with ACID and BASE

Visualized

32

How did we get here?

33

• Explosion of social media sites (Facebook, Twitter) with large data needs

• Rise of cloud-based solutions such as Amazon S3 (simple storage solution)

• Just as moving to dynamically-typed languages (Ruby/Groovy), a shift to dynamically-typed data with frequent schema changes

• Open-source community

NoSQL Types

34

No SQL database are classified into four types:

• Key Value pair based

• Column based

• Document based

• Graph based

Key Value Pair Based

35

• Designed for processing dictionary. Dictionaries contain a collection of records having fields containing data.

• Records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data with in the database.

Example:

CouchDB, Oracle NoSQL Database, Riak etc.

• We use it for storing session information, user profiles, shopping cart data.

• We would avoid it when we need to query data having relationships between entities.

Column based

36

• It store data as Column families containing rows that have many columns associated with a row key. Each row can have different columns.

• Column families are groups of related data that is accessed together.

Example:

Cassandra, HBase, Hypertable, and Amazon DynamoDB.

• We use it for content management systems, blogging platforms, log aggregation.

• We would avoid it for systems that are in early development, changing query patterns.

Document Based

37

• The database stores and retrieves documents. It stores documents in the value part of the key-value store.

• Self-describing, hierarchical tree data structures consisting of maps, collections, and scalar values.

Example:

LotusNotes, MongoDB, CouchDB, OrientDB, RavenDB.

• We use it for content management systems, blogging platforms, web analytics, real-time analytics, e-commerce applications.

• We would avoid it for systems that need complex transactions spanning multiple operations or queries against varying aggregate structures.

Graph Based

38

• Store entities and relationships between these entities as nodes and edges of a graph respectively. Entities have properties.

• Traversing the relationships is very fast as relationship between nodes is not calculated at query time but is actually persisted as a relationship.

Example:

Neo4J, Infinite Graph, OrientDB, FlockDB.

• It is well suited for connected data, such as social networks, spatial data, routing information for goods and supply.

Top 10 of NoSQL DB with Data

Models

39

Common Advantages

40

• Cheap, easy to implement (open source) • Data are replicated to multiple nodes (therefore

identical and fault-tolerant) and can be partitioned ▫ Down nodes easily replaced

▫ No single point of failure

• Easy to distribute • Don't require a schema • Can scale up and down • Relax the data consistency requirement (CAP)

What is not provided by NoSQL

41

• Joins

• Group by

• ACID transactions

• SQL

• Integration with applications that are based on SQL

Some Statistics

42

• Facebook Search

• MySQL > 50 GB Data

▫ Writes Average : ~300 ms

▫ Reads Average : ~350 ms

• Rewritten with Cassandra > 50 GB Data

▫ Writes Average : 0.12 ms

▫ Reads Average : 15 ms

Don’t forget about the DBA

43

• It does not matter if the data is deployed on a NoSQL platform instead of an RDBMS.

• Still need to address: ▫ Backups & recovery

▫ Capacity planning

▫ Performance monitoring

▫ Data integration

▫ Tuning & optimization

• What happens when things don‟t work as expected and nodes are out of sync or you have a data corruption occurring at 2am?

• Who you gonna call? ▫ DBA and SysAdmin need to be on board

NoSQL vs. SQL Summery

44

NoSQL vs. SQL Summery Features

45

Visual Guide to NoSQL Systems

46

Intro to Big Data - cibtrc.com€¦ · ACID Theorem 5 • Atomic: All of the work in a transaction...

Documents

Transcript of Intro to Big Data - cibtrc.com€¦ · ACID Theorem 5 • Atomic: All of the work in a transaction...