Apache cassandra

Apache CassandraA massively scalable Not only SQL open-source distributed database

Outline

Overview

Data Model

Installation

CQL

Connecting nodes

Applications

Bibliography

Overview

How Cassandra evolved?

Core technologies

Google BigTable

Amazon Dynamo

Rapid Evolution

Developed at Facebook for inbox search

Initial release in 2008.

Latest release version 3.8 at March 8 2016.

It’s top level Apache project since 2010.

Features

Perfect for managing large amount of structured, semi-structured and unstructured data across multiple data centers and the cloud, means industry stength Big data can be handled.

Delivers continuous availability, linear and elastic scalability across many commodity servers with no single point of failure, along with a dynamic data model designed for maximum flexibility and fast response times.

Complies to “AID” property, The “C” of ACID does not apply to Cassandra, so there is no concept of referential integrity or foreign keys.

Supports “always on” architecture and real time applications.

Interacting with Cassandra

➔ Authorised users can connect to any node in any data center and access data using the CQL.

➔ For production, DataStax provides drivers to pass CQL statements from client to cluster and back.(just like JDBC-ODBC).

➔ CQL shell (cqlsh) is used to interact with Cassandra.

➔ Developer and operator roles are merged into “DevOps”.

➔ Helenos provides gui interface for data exploration and schema management in Cassandra.

Automatic data distribution

Just make a ring (synonymous to database cluster) and Cassandra will distribute data to all of the nodes participating in the ring.

No programming required to distribute data across nodes - Fragment and Location Transparency.

Due to disruption in network connectivity or node failure, Data inconsistency may arise, which is automatically mitigated by Cassandra. Also there is a script, called repair, that can be run to ensure consistency.

All the nodes in a cluster play the same role. Each node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.

Built-in and customizable Replication

● Automatically stores redundant copies of data across nodes that participate in Cassandra ring.

● Configuration required to make replication possible across one data center or multiple data centers.

● If it is detected that some of the nodes responded with an out-of-date value, Cassandra will return the most recent value to the client.

● After returning the most recent value, Cassandra performs a read repair in the background to update the stale values.

● When a node goes down, read/write requests can be served from other nodes in the network.

Linear Scalability

Horizontal scaling: add commodity hardware to a cluster.

fig. 1: linear scalability courtesy: DataStax

Components of Cassandra

Node: Place to store data,Data Center Collection to store data,Cluster One or more data centersCommit log Crash recovery mechanism, stores every write operation.Mem-Table Memory resident data structure.SStable Disk file to which data is flushed from the mem-table when its

contents reach a threshold level.Bloom Filter Algorithm that tests whether an element is a member of a set.

A special cache, accessed after every query.

Operation mechanism

Write OperationEvery write activity of nodes is captured by the commit logs.Later the data will be captured and stored in the mem-table.Whenever the memtable is full, data will be written into the SStable data file.All writes are automatically partitioned and replicated throughout the cluster.Read OperationDuring read operation, Cassandra gets value from the mem-table and checks

the bloom filter to find the appropriate SSTable that holds the required data.

Data Model of Cassandra

Structure of a cluster

Data Model

Combination of key-value and tabular database which is schema-free.

A column family is equivalent to table.Column families may be created, dropped, and altered at run-time without blocking updates and queries.

A row has multiple columns, each of which has a name, value and a time stamp.

Different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.

uses gossip Protocol for communication between nodes in a cluster.

Difference from relational database

● The concept of JOINs between tables does not exist, although client side joins are used in applications.

● Data is emphasized to be denormalized through features like collections.

● The paradigm shift between relational and NoSQL means that a straight move of data from an RDBMS database to Cassandra will be doomed to failure.So, perform ETL.

Installation

Hardware requirements for Production

● Memory: at least 8GB.● CPU: at least 4-core processor.● SSDs i.e.Solid state disks, are recommended for Cassandra.● Cassandra is so efficient in writing that CPU is a limiting factor.● Cassandra is recommended to be deployed on Extended File

System(XFS) or ext4.● Recommended network bandwidth is 1000 Mbit/sec or greater.

Installing Cassandra

● Install latest Java 7 release

● Download Cassandra from http://cassandra.apache.org/download/ and untar tarball.

● Cassandra includes important directories like bin (executables viz. Cassandra, cqlsh), conf (configuration files cassandra.yaml).

● Configure Cassandra.yaml to change permission for listed folders.

● Set JAVA_HOME, CASSANDRA_HOME and PATH environment variable according to your installation folder.

● Verify ports 22 (SSH), 7000, 7001, 7199, 9042, 9160 are open and available.

http://cassandra.apache.org/download/

Cassandra Query Language (CQL)

Keyspace operations

● Keyspace is outermost container of data in a cluster.

● A cluster contains one keyspace per node.

● Created with properties viz. replication, durable writes (whether to use commitlog for updates after failure).

Table/Column family Operations

Executing statements in Batch

Helps in executing multiple modification statements simultaneously.

Collection data type: list

Collection data type: Map

Map data type stores key value pair. Using map, timestamp related information in user profiles can be stored.

A table in Cassandra is a distributed multi-dimensional map indexed by a key.

Facts where CQL differs from SQL

Inserting a new record with an existing primary key will replace the old one, without any warning.

When inserting more than 1,000 records, cqlsh may ignore the rest. It is recommended to use the ETL sstableloader.

There is not any autoincrement option.

No case-sensitive field names.

Connecting nodes in a cluster …..

Setup for data distribution

To communicate from one server to another Cassandra needs to open the ports 7000, 7001, 7199(SSL), 9042, 9160.

There is not any master node, So failover is automatic.

Each node must specify a live seed node** to contact in its configuration.

To let nodes communicate just change endpoint_snitch parameter from SimpleSnitch to RackingInferringSnitch.

Node list is visible with \cassandra\bin\nodetool status.

** Seeds are used during startup to discover cluster. Needs in ring tends to send gossip message to seeds more often than to non-seeds. Any node can be seed, there is nothing special about it.

Applications

Playlists and collections (Spotify)

Personalization and recommendation Engine (Ebay)

Messaging (Instagram)

Sensor Data (Zonar).

Corporate users: Cassandra users include Facebook, Twitter, Cisco, Apple, Ebay, Netflix, RackSpace.

Trends

Bibliography

Wikipedia Page on “Apache Cassandra”, retrieved in Feb 2016.

Tutorials Point “Cassandra Tutorial” -retrieved during March 2016.

Wikiversity article on “Big Data: Cassandra” ,retrieved in March 2016.

Datastax academy course, “DS201 Cassandra Core Concepts”, accessed in March 2016.

Planet Cassandra documentation on “Apache Cassandra”, retrieved in March 2016.

Stack Overflow developer survey 2016 report.

Thank You 😌

Find this presentation at: www.slideshare.net/VishalKumarJaiswal2

Feel free to contact me at: [email protected]©Vishal Kumar Jaiswal

Apache cassandra

Data & Analytics

Transcript of Apache cassandra