Intro to Big Data - cibtrc.com€¦ · ACID Theorem 5 • Atomic: All of the work in a transaction...

NoSQL Database

By: Shahab Safaee & Morteza Zahedi

Software Engineering PhD

Email: safaee.shx@gmail.com , morteza.zahedi.a@gmail.com

cibtrc.ir

cibtrc

Agenda • Some history

• Relational databases

• ACID Theorem

• Scaling Up

• Distributed Database Systems

• CAP Theorem

• What is NoSQL?

• BASE Transactions

• NoSQL Types

• Some Statistics

• NoSQL vs. SQL Summery

A brief history of databases

Relational databases

• Benefits of Relational databases: ▫ Designed for all purposes

▫ ACID

▫ Strong consistency, concurrency, recovery

▫ Mathematical background

▫ Standard Query language (SQL)

▫ Lots of tools to use with i.e: Reporting, services, entity frameworks, ...

▫ Vertical scaling (up scaling)

ACID Theorem

• Atomic: ▫ All of the work in a transaction completes (commit) or none of it completes ▫ All operations in a transaction succeed or every operation is rolled back.

• Consistent: ▫ A transaction transforms the database from one consistent state to another

consistent state. Consistency is defined in terms of constraints. ▫ On the completion of a transaction, the database is structurally sound.

• Isolated: ▫ The results of any changes made during a transaction are not visible until the

transaction has committed. ▫ Transactions do not contend with one another. Contentious access to data is

moderated by the database so that transactions appear to run sequentially.

• Durable: ▫ The results of a committed transaction survive failures ▫ The results of applying a transaction are permanent, even in the presence of

failures.

Era of Distributed Computing

But...

• Relational databases were not built for distributed applications.

Because...

• Joins are expensive

• Hard to scale horizontally

• Impedance mismatch occurs

• Expensive (product cost, hardware , Maintenance)

Era of Distributed Computing

But... • Relational databases were not built for

distributed applications.

Because... • Joins are expensive • Hard to scale horizontally • Impedance mismatch occurs • Expensive (product cost, hardware

, Maintenance) And … It‟s weak in: • Speed (performance) • High availability • Partition tolerance

Scaling Up

• Issues with scaling up when the dataset is just too big

• RDBMS were not designed to be distributed

• Began to look at multi-node database solutions

• Known as „scaling out‟ or „horizontal scaling‟

• Different approaches include:

▫ Master-slave

▫ Sharding

Scaling RDBMS – Master/Slave

• Master-Slave

▫ All writes are written to the master. All reads performed against the replicated slave databases

▫ Critical reads may be incorrect as writes may not have been propagated down

▫ Large data sets can pose problems as master needs to duplicate data to slaves

Scaling RDBMS - Sharding

• Partition or sharding

▫ Scales well for both reads and writes

▫ Not transparent, application needs to be partition-aware

▫ Can no longer have relationships/joins across partitions

▫ Loss of referential integrity across shards

Sharding Advantages

• Tables are divided and distributed into multiple servers

• Reduces index size, which generally improves search performance

• A database shard can be placed on separate hardware

• greatly improving performance

• if the database shard is based on some real-world segmentation of the data then it may be possible to infer the appropriate shard membership easily and automatically

Other ways to scale RDBMS

• Multi-Master replication

• INSERT only, not UPDATES/DELETES

• No JOINs, thereby reducing query time

▫ This involves de-normalizing data

• In-memory databases

What we need?

We need a distributed database system having such features: • High Concurrency • High Availability • Fault tolerance • High Scalability • Low latency • Efficient Storage • Reduce Manage and Operation Cost

Which is impossible!!!

According to CAP theorem

Distributed Database Systems

• Data is stored across several sites that share no physical component.

• Systems that run on each site are independent of each other.

• Appears to user as a single system.

Distributed Data Storage

• Partitioning :

▫ Data is partitioned into several fragments and stored in different sites.

▫ Horizontal – by rows.

▫ Vertical – by columns.

• Replication : ▫ System maintains multiple

copies of data, stored in different sites.

Replication and Partitioning can be combined !

Partitioning

• Locality of reference – data is most likely to be updated and queried locally.

Replication

• Pros – Increased availability of data and faster query evaluation. • Cons – Increased cost of updates and complexity of concurrency

control.

CAP Theorem

• In 2000, Berkeley, CA, researcher Eric Brewer published his now foundational CAP Theorem

▫ (consistency, availability and partition tolerance)

• which states that it is impossible for a distributed

computer system to simultaneously provide all three CAP guarantees.

• In May 2012, Brewer clarified some of his positions on the oft-used “two out of three” concept.

CAP Theorem

• Consistency: ▫ all nodes see the same data at the same time

• Availability:

▫ a guarantee that every request receives a response about whether it was successful or failed

• Partition tolerance: ▫ the system continues to operate despite arbitrary message loss or

failure of part of the system

CAP Theorem

CAP – 2 of 3

• If there are no partitions, it is clearly possible to provide consistent, available data (e.g. read-any write-all).Best-effort availability:

• Examples: ▫ RDBMs

CAP – 2 of 3

• Trivial:

▫ The trivial system that ignores all requests meets these requirements.

• Best-effort availability:

▫ Read-any write-all systems will become unavailable only when messages are lost.

• Examples: ▫ Distributed database systems, BigTable

CAP – 2 of 3

• Trivial:

▫ The service can trivially return the initial value in response to every request.

• Best-effort consistency:

▫ Quorum-based system, modified to time-out lost messages, will only return inconsistent(and, in particular, stale) data when messages are lost.

• Examples: ▫ Web cashes, Dynamo

What is NoSQL?

• Stands for Not Only SQL

▫ Term was redefined by Eric Evans after Carlo Strozzi.

• Class of non-relational data storage systems

• Usually do not require a fixed table schema nor do they use the concept of joins

• All NoSQL offerings relax one or more of the ACID properties (Based on the CAP theorem)

NoSQL Definition

From www.nosql-database.org:

• Next Generation Databases mostly addressing some of the points: ▫ being non-relational

▫ distributed

▫ open-source

▫ horizontal scalable.

• The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly.

• Often more characteristics apply as: ▫ schema-free

▫ easy replication support

▫ simple API

▫ eventually consistent / BASE (not ACID)

▫ a huge data amount, and more.

NOSQL Common concepts

• Map-reduce

• Sharding

• MVCC

• Consistent hashing

• Vector clocks

Consistent hashing

• The idea behind consistent hashing is to use the same hash

function for both the object hashing and the node hashing.

Vector clocks • Alice, Ben, Cathy, and Dave are planning to meet next week

date = Wednesday vclock = Alice:1

date = Tuesday vclock = Alice:1, Ben:1, Dave:1

date = Thursday vclock = Alice:1, Ben:1, Cathy:1, Dave:2

NoSQL Distinguishing

Characteristics • Large data volumes

▫ Google‟s “big data” • Scalable replication and distribution

▫ Potentially thousands of machines ▫ Potentially distributed around the world

• Queries need to return answers quickly • Mostly query, few updates • Asynchronous Inserts & Updates • Schema-less • ACID transaction properties are not needed – BASE • Open source development • Weaker Concurrency Model than ACID • Efficient use of distributed indexes and RAM

BASE Transactions

• Acronym contrived to be the opposite of ACID ▫ Basically Available

The database appears to work most of the time (Replication and Sharding Mechanisms).

▫ Soft state Stores don‟t have to be write-consistent, nor do different replicas have to

be mutually consistent all the time. Consistency guaranty with Application Developer.

▫ Eventually Consistent Stores exhibit consistency at some later point (e.g., lazily at read time). Guaranties consistency only at undefended future time.

• Characteristics ▫ Weak consistency ▫ Availability first ▫ Optimistic ▫ Simpler and faster

BASE vs ACID

• ACID: ▫ Strong Consistency ▫ Less Availability ▫ Pessimistic Concurrency ▫ Complex

• BASE ▫ Availability is the most important thing. ▫ Weaker consistency (Eventual) ▫ Simple and Fast ▫ Optimistic

CAP Theorem with ACID and BASE

Visualized

How did we get here?

• Explosion of social media sites (Facebook, Twitter) with large data needs

• Rise of cloud-based solutions such as Amazon S3 (simple storage solution)

• Just as moving to dynamically-typed languages (Ruby/Groovy), a shift to dynamically-typed data with frequent schema changes

• Open-source community

NoSQL Types

No SQL database are classified into four types:

• Key Value pair based

• Column based

• Document based

• Graph based

Key Value Pair Based

• Designed for processing dictionary. Dictionaries contain a collection of records having fields containing data.

• Records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data with in the database.

Example:

CouchDB, Oracle NoSQL Database, Riak etc.

• We use it for storing session information, user profiles, shopping cart data.

• We would avoid it when we need to query data having relationships between entities.

Column based

• It store data as Column families containing rows that have many columns associated with a row key. Each row can have different columns.

• Column families are groups of related data that is accessed together.

Example:

Cassandra, HBase, Hypertable, and Amazon DynamoDB.

• We use it for content management systems, blogging platforms, log aggregation.

• We would avoid it for systems that are in early development, changing query patterns.

Document Based

• The database stores and retrieves documents. It stores documents in the value part of the key-value store.

• Self-describing, hierarchical tree data structures consisting of maps, collections, and scalar values.

Example:

LotusNotes, MongoDB, CouchDB, OrientDB, RavenDB.

• We use it for content management systems, blogging platforms, web analytics, real-time analytics, e-commerce applications.

• We would avoid it for systems that need complex transactions spanning multiple operations or queries against varying aggregate structures.

Graph Based

• Store entities and relationships between these entities as nodes and edges of a graph respectively. Entities have properties.

• Traversing the relationships is very fast as relationship between nodes is not calculated at query time but is actually persisted as a relationship.

Example:

Neo4J, Infinite Graph, OrientDB, FlockDB.

• It is well suited for connected data, such as social networks, spatial data, routing information for goods and supply.

Top 10 of NoSQL DB with Data

Models

Common Advantages

• Cheap, easy to implement (open source) • Data are replicated to multiple nodes (therefore

identical and fault-tolerant) and can be partitioned ▫ Down nodes easily replaced

▫ No single point of failure

• Easy to distribute • Don't require a schema • Can scale up and down • Relax the data consistency requirement (CAP)

What is not provided by NoSQL

• Joins

• Group by

• ACID transactions

• SQL

• Integration with applications that are based on SQL

Some Statistics

• Facebook Search

• MySQL > 50 GB Data

▫ Writes Average : ~300 ms

▫ Reads Average : ~350 ms

• Rewritten with Cassandra > 50 GB Data

▫ Writes Average : 0.12 ms

▫ Reads Average : 15 ms

Don’t forget about the DBA

• It does not matter if the data is deployed on a NoSQL platform instead of an RDBMS.

• Still need to address: ▫ Backups & recovery

▫ Capacity planning

▫ Performance monitoring

▫ Data integration

▫ Tuning & optimization

• What happens when things don‟t work as expected and nodes are out of sync or you have a data corruption occurring at 2am?

• Who you gonna call? ▫ DBA and SysAdmin need to be on board

NoSQL vs. SQL Summery

NoSQL vs. SQL Summery Features

Visual Guide to NoSQL Systems

Intro to Big Data - cibtrc.com€¦ · ACID Theorem 5 • Atomic: All of the work in a transaction...

Documents

Transcript of Intro to Big Data - cibtrc.com€¦ · ACID Theorem 5 • Atomic: All of the work in a transaction...

All SAP CRM Transaction Codes

TRANSACTION COORDINATOR AGREEMENT SIMPLICLOSE(sm · TRANSACTION COORDINATOR AGREEMENT 5. Representations and Warranties. Simpliclose represents and warrants all work will be done

Business papers that show the nature of a transaction and provide all the information necessary to account for the transaction properly.

Transaction Management - Simon Fraser University · Transaction Management Concurrency Control (3) ... In every transaction, all lock actions precede all unlock actions. Growing phase:

Platon Oeuvres Completes

Foreign Exchange Transaction PDS - Westpac · Foreign Exchange Transaction ... Scenario 1 - Foreign Currency Payment ... Transactions for all such Value Dates (transactions with

DEALER SOLUTIONS COMPLETES ANOTHER TRANSACTION!files.constantcontact.com/5d4a732b601/e57fd550-fd8... · DEALER SOLUTIONS COMPLETES ANOTHER TRANSACTION! President/C.E.O. of Dealer

oeuvres completes

Transaction History - All Accounts view

COBRE COMPLETES ACQUISITION OF 100% OWNERSHIP OF … · Answer this question if your response to Q2.1 is “Being issued as part of a transaction or transactions previously announced

fitxes completes

Note: All transaction charges are subject to 15% ... - Centenary Bank G… · Note: All transaction charges are subject to 15% excise duty. TIERS CASH DEPOSIT CASH WITHDRAWAL BILL

Drug Transaction Data Reporting...4 of 8 Drug Transaction Data Reporting View transaction data Once you select an invoice number, a transaction report will appear showing all DSCSA-eligible

Blockchain Transaction Processing - ExpoLabBlockchain Transaction Processing 5 Blockchain Consensus The common denominator for all the blockchain applications is the underlying distributed

RSD All Market Segments_Insurer Confirmation and Transaction Diary_v1

Apollo and Athene to Merge in All Stock Transaction

HERMES completes HERMES completes …HERMES Hotspot Ecosystem Research on the Margins of European Seas News update Issue 9 Summer 2007 HERMES completes HERMES completes ... scientific

Fbm 2015 completes catalog

All Transaction Codes

Terminal Payment Services · the issuing bank all take a share of the transaction charge that is billed to the merchant to process a transaction typically all billing is performed