Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

40
Referent Einrichtung Titel des Vortrages 1 WP-Benchmarking Top NoSQL Databases Apache Cassandra, Apache HBase and MongoDB Presented By Athiq Ahamed Supriya

Transcript of Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

Page 1: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 1

WP-Benchmarking Top NoSQL Databases

Apache Cassandra, Apache HBase and MongoDB

Presented By Athiq Ahamed

Supriya

Page 2: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 2

Introduction Enormous amount of data-BigData Scalabilty issue in RDBMS Rise of NoSQL databases

Amazon Dynamo Big table CAP Theorem BASE system

Page 3: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 3

CAP Theorem Consistency Availability Partition tolerance

CAP theorem states that only two of the properties can be achieved at a time.

Page 4: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 4

RDBMS NoSQL

Supports powerful query language

Supports very simple query language

It has a fixed schema No fixed schema

Follows ACID (Atomicity, Consistency, Isolation and Durability)

It is only eventually consistent

Supports transactions Does not support transactions

RDBMS vs NoSQL

Content:tutorialspoint.com

Page 5: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 5

Basically available: System guarantees availability, in terms of the CAP theorem

Soft state: State of the system may change over time, because of eventual consistency model

Eventual consistency: System will become consistent over time

BASE

Content:www.edureka.in

Page 6: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 6

Fast Performance is the key. POC processes include right benchmarks:

Configurations Parameters Workloads

Making the right choice!

Selection of NoSQL

Page 7: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 7

Yahoo Cloud Serving Benchmark (YCSB)

Top 3 NoSQL databases-Apache Cassandra, Apache Hbase and MongoDB.

Amazon Web Services EC2 instances for hosting the tests

Test performed 3 times on 3 different days

Benchmark configuration

Page 8: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 8

The tests ran on large size instances (15GB RAM and 4 CPU cores)

Instances used customized Ubuntu with Oracle Java 1.6 installed as a base.

A customized script written to drive the benchmark processes

Benchmark configuration

Page 9: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 9

Each NoSQL system performs differently, not alike.

Components and Internal working.

Apache Cassandra: Columnar database model

Apache HBase: Columnar database model

MongoDB: Document storage database model

Understanding NoSQL Databases

Page 10: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 10

Apache Cassandra Cassandra is scalable, fault-tolerant, and

consistent. All nodes are equal.

Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.

Key components: Node, Cluster, Commit log, Mem-table, SSTable and Bloom filter

Content:http://www.tutorialspoint.com/cassandra/cassandra_architecture.htm

Page 11: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 11

Ring structure, peer to peer architecture

All nodes are equal

This improves general database availablity

Scaling up and scaling down is easier

Cassandra has key-value, column oriented database

Apache Cassandra

Page 12: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 12

Apache Cassandra

Content:http://demoiselle.sourceforge.net/component/demoiselle-cassandra/1.0.0/images/datamodel1.png

Page 13: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 13

Cassandra has an internal keyspace called system, stores metadata about the cluster.

Metadata: The node‘s token The cluster name Keyspace n schema definitions (dynamic

loading) Whether or not the node is bootstrapped

Apache Cassandra

Content:https://www.edureka.co/blog/category/apache-cassandra/

Page 14: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 14

Commit log: Crash recovery mechanism. Every write operation is written to commit log

Mem-Table: A memory resident data structure.

SSTable: It is a disk file to which the data is flushed from the mem-table

Apache Cassandra

Page 15: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 15

Bloom filters are used as a performance booster

Bloom filter are very fast, quick algorithms for testing a member in the set.

Bloom filters serves as a special kind of cache – quick lookups/search as they reside in memory

Apache Cassandra

Page 16: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 16

Gossip protocol: Communiction between nodes, co-ordination and failure check

Anti-Entropy protocol: Replica sync mechanism enusing data on different nodes are updated (Merkle trees)

Snitches ensures host proximity

Apache Cassandra

Page 17: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 17

Apache Cassandra- Read/Write operation

Page 18: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 18

Sparse, distributed, sorted map and multidimensional and consistent.

Hbase is a Key/value store

Consists Row key, Column family, columns and timestamp.

Apache HBase

Page 19: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 19

Apache HBase

Content:http://zhangjunhd.github.io/assets/2013-02-25-apache-hbase/rowkey-columnkey.png

Page 20: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 20

Region: Contiguous rows form a region Region server(RS): Serves one or more regions. Master server: Daemon responsible for

managing Hbase cluster HDFS: Distributed, open source file system

containing HBase‘s data Zookeeper: Distributed, open source co-

ordinated service for co-ordination of master and region servers.

Apache HBase Components

Content: https://www.mapr.com/blog/in-depth-look-hbase-architecture

Page 21: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 21

Apache Hbase Architecture

Page 22: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 22

Client obtains meta table RS from Zookeeper

Client gets RS which holds the corresponding rowkey

Client receives the row from the respective Region server

Client caches this information along with the location of meta table server.

First Read/Write to HBase

Page 23: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 23

WAL: Write Ahead Log is a file on the distributed file system. It is used to store new data

Block Cache: It is the read cache. It stores frequently read data in memory

Mem Store: Write cache that stores new data which is not written to disk yet.

Hfiles stores the rows as sorted key values on disk

HBase RS Components

Page 24: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 24

Client writes the data to the WAL file stored on disk

WAL is used to recover not yet persisted data in case a server crashes.

Once data is written to WAL, it is placed in Mem Store

Hbase Write steps (1)

Page 25: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 25

All write/read are to/from the primary node.

HDFS replicates WAL and Hfile blocks. Replication happens automatically.

When data is written in HDFS, one copy is written locally and then it is replicated to a secondary node and later to tertiary node.

HDFS Write steps (2)

Page 26: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 26

Cassandra usecase: Availability and Partition tolerant requirements.

Consistency is tunable by setting it high in the option

Hbase usecase: Consistency and Scalability. However, at less number of nodes/threads, availability is achieved high

Cassandra and Hbase

Page 27: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 27

Document-oriented database

High performance and automatic scaling

High consistency and partition tolerant

Replication and failover for high availability

Low latency

Flexible indexing

MongoDB

Page 28: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 28

Document is the basic unit for MongoDB(row)

Collection is similar to a table

A single instance has multiple independent databases

Every document has a special key, “_id”

Powerful JavaScript shell for administration

Configdb contains metadata of clusters

MongoDB Concepts

Page 29: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 29

MongoDB Simple Architecture

Page 30: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 30

A mongo receives queries from applications

Uses metadata from config server for the data

Mangos directs write operations to a particular shard

Mongos uses the cluster metadata from the config database

Read/Write MongoDB

Page 31: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 31

Scalability Availability Partition Tolerant Consistency

MOST IMPORTANT PERFORMANCE Yahoo Cloud Serving Benchmark (YCSB)

Recap Importance of Benchmark and Factors

Page 32: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 32

Results: Load Process

Page 33: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 33

Results: Read/Write Mix Workload

Page 34: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 34

Results: Read/Scan Mix Workload

Page 35: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 35

Results: Read Latency across all workloads

Page 36: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 36

Results: Insert Latency across all workloads

Page 37: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 37

Lets MIGRATE from traditional data base !!!!

Live Demo

Page 38: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 38

Identify data model for the application Corresponding data sets have to be known Whether the application requires replication Identify the performance requirements Prototype the application Test the performance of the prototype

Discussion

Page 39: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 39

Conclusion NoSQL replaced tradition relational databases Performance is the key feature Importance of benchmarks Top three NoSQL data base’s performance

tested Cassandra outperforms all the other NoSQL data

bases Decide based on application

Page 40: Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

ReferentEinrichtung Titel des Vortrages 40