CASSANDRA - Next to RDBMS

CASSANDRA – An Open Source Data Storage system

Presented By :Vipul KumarCr No. - 11/269

UNIVERSITY COLLEGE OF ENGINEERING, KOTA RAJASTHAN TECHNICAL UNIVERSITY

Presented To :Mr R K Banyal SirCSE Department

COMPUTER SCIENCE AND ENGINEERING DEPARTMENT

Contents

• What is Cassandra ?• History• Data Model• System architecture• Key features and benefits• Who is using Cassandra ?• Conclusion and future scope

Contents

Apache Cassandra™ is a free

DistributedHigh performanceExtremely scalableFault tolerant(i.e. no single point of failure..)open source NoSQL database.

Definition of Cassandra

Big Table Dynamo

The history of Cassandra

• Table is a multi dimensional map indexed by key (row key).

• Columns are grouped into Column Families.• 2 Types of Column Families– Simple– Super (nested Column Families)

• Each Column has– Name– Value– Timestamp

Data Model

Data Model

• PartitioningHow data is partitioned across nodes

• ReplicationHow data is duplicated across nodes

System Architecture

• Nodes are logically structured in Ring Topology.

• Hashed value of key associated with data partition is used to assign it to a node in the ring.

• Hashing rounds off after certain value to support ring structure.

• Lightly loaded nodes moves position to alleviate highly loaded nodes.

Partitioning

• Each data item is replicated at N (replication factor) nodes.

• Different Replication Policies– Rack Unaware – replicate data at N-1 successive

nodes after its coordinator– Rack Aware – uses ‘Zookeeper’ to choose a leader

which tells nodes the range they are replicas for– Datacenter Aware – similar to Rack Aware but leader

is chosen at Datacenter level instead of Rack level.

Replication

Replication

Gossip Protocol

• Network Communication protocols inspired for real life rumor spreading.

• Periodic, Pairwise, inter-node communication.• Low frequency communication ensures low cost.• Random selection of peers.• Example – Node A wish to search for pattern in data

– Round 1 – Node A searches locally and then gossips with node B.

– Round 2 – Node A,B gossips with C and D.– Round 3 – Nodes A,B,C and D gossips with 4 other nodes ……

• Round by round doubling makes protocol very robust.

Key features & benefits

• Gigabyte to Petabyte scalability• Big data scalability• No single point of failure• Easy Replication / Data distribution• No need for caching software• Flexible Schema

Big Data Scalability

• Capable of comfortably scaling to petabytes• New nodes = linear performance increases• Add new nodes online

2

1

2

1

4

3

Double throughputcapacity

No single point of failure

• All nodes are same• Read/write from any node• Can replicate data among different physical data center racks

Easy Replication

• Transparency handled by Cassandra• Multi data center capable• Exploit all the benefit of cloud computing

No need for caching layer

• Peer to peer layer removes need for special caching layer and the programming.

• The database use the memory from all the participating nodes to cache the assigned data.

Flexible Schema

• Dynamic schema design allows for more flexible data storage than rigid RDBMS

• Handles structured, semi-structured and unstructured data.• No offline / downtime for schema changes

Who uses Cassandra

Conclusion & future scope

• Cassandra is an open source storage system providing scalability, high performance, and wide applicability.

• Cassandra can support a very high update throughput while delivering low latency.

• Future works involves adding compression, ability to support atomicity across keys and secondary index support.

Thank You

CASSANDRA - Next to RDBMS

Data & Analytics

Transcript of CASSANDRA - Next to RDBMS