Introduction to cassandra

Presented on 26th Feb 2014

Introduction to Cassandra

Scope

• Introduction to Cassandra and NoSql• Understanding Cassandra data model• Configuration, read and writing data in Cassandra• CQL

2

What is Cassandra

• A Database• Uses Amazon’s Dyanamo’s fully distribution design• Uses Google’s BigTable’s column family based data model• Developed by Facebook (The team was led by Jeff Hammerbacher,

with Avinash Lakshman, Karthik Ranganathan, and Prashant Malik (Search Team))• Open source in 2008

3

Problems with RDBMS

• Horizontal scaling: In RDBMS as the size grows the joins become slows so the retrieval become slow.• Vertical scaling: adding more hardware, memory, faster processor or

upgrading disk space. Adding hardware creates problem like data replication, consistency, fail over mechanism.• Caching layer in large system: like memcache, EHCache, Oracle

Coherence. Updation in the cache and data base is exacerbated over a cluster.

4

Cassandra

• Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneable consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook.”

5

Why Cassandra

• Fault tolerant• Decentralized• Eventually consistent• Rich data model• Elastic• Highly Available• No SPF (Single point failure)

6

Cap theorem

• University of California at Berkeley, Eric Brewer posted his CAP theorem in 2000.

• The theorem states that within a large-scale distributed data system, there are three requirements that have a relationship of sliding dependency.

• Consistency: All database clients will read the same value for the same query, even given concurrent updates.

• Availability: All database clients will always be able to read and write data.

• Partition Tolerance: The database can be split into multiple machines; it can continue functioning in the face of network segmentation breaks.

7

Cap theorem (cont.)

• According to theorem only two of the three can be strongly supported distributed data system

• CA: it means system will block when the system will partitions. so in this the system is been limited to a single data centre to mitigate this.

• CP: it allow data sharding in order to data scaling. The data will be consistent but data may loss whenever a node goes down.

• AP: system may return inaccurate data, but the system will always be available, even in the face of network partitioning. DNS is perhaps the most popular example of a system that is massively scalable, highly available, and partition-tolerant.

8

Fault Tolerant

• Data is automatically replicated to multiple nodes based on replication factor.• Replication across multiple data center• Failed nodes can be replaced with no downtime.• Uses Accrual Failure Detector for fault detection.

10

Decentralization

• Every node in the cluster is identical (No client server architecture)• There is no single points of failure.

11

Eventual consistency

• Uses BASE (Basically Available Soft-state Eventual) Consistency.• As the data is replicated, the latest version of something is sitting on

at least one node in the cluster, but old version will still be on other node.• Eventually all nodes will see the latest version.

12

Eventual consistency (Cont.)

• Tuneable Consistency: a replication factor to the number of nodes in the cluster you want the updates to propagate to.• Consistency level is a setting that clients must specify on every

operation and that allows you to decide how many replicas in the cluster must acknowledge a write operation or respond to a read operation in order to be considered successful. That’s the part where Cassandra has pushed the decision for determining consistency out to the client. so strict consistency can be achieved assigning same value to replication factor and consistency level.

13

Rich Data Model

• Keyspace• Column family• Rows• Column• Super column

14

Column family "ToyStore" : { "Toys" : { "GumDrop" : { "Price" : "0.25", "Section" : "Candy" } "Transformer" : { "Price" : "29.99", "Section" : "Action Figures" } "MatchboxCar" : { "Price" : "1.49", "Section" : "Vehicles" } } }, "Keyspace1" : null, "system" : null

15

Super Column family

16

"ToyCorporation" : { "ToyStores" : { "Ohio Store" : { "Transformer" : {"Price" : "29.99", "Section" : "Action Figures"} "GumDrop" : {"Price" : "0.25","Section" : "Candy"} "MatchboxCar" : {"Price" : "1.49","Section" : "Vehicles"} } "New York Store" : { "JawBreaker" : {"Price" : "4.25","Section" : "Candy"} "MatchboxCar" : {"Price" : "8.79","Section" : "Vehicles"} } } }

17

Keyspace

It is similar as we have schema in RDBMS, it contains a name and a set of attributes that defines keyspace wide behaviour.various attributes are:

1. Replication factor: if it is set to 3 then 3 nodes will be having the copy of each row.

2. Replica placement strategy: like SimpleStrategy (RackUnawareStrategy), OldNetworkTopologyStrategy (Rack-AwareStrategy), and NetworkTopologyStrategy (Datacenter-ShardStrategy).

3. Column family: will discussed.18

Column family

• A column family is a container for columns, analogous to the table in a relational system.• A Column family holds an ordered list of columns, which is been

refered by the column name.

• [Keyspace][ColumnFamily][Key][Column]

19

Column family (cont.)

column family has two attributes: a name and a comparator. comparator indicate the sorting order when they are returns against a query. comparator can be of following types: AsciiType, BytesType, LexicalUUIDType, IntegerType, LongType, TimeUUIDType, or UTF8Type, Custom (plug your class to cassandra which should be extending org.apache.cassandra.db.marshal.AbstractType)

20

Column family (cont.)

Hotel {

• key: AZC_043 { name: Cambria Suites Hayden, phone: 480-444-4444,

address: 400 N. Hayden Rd., city: Scottsdale, state: AZ, zip: 85255}

• key: AZS_011 { name: Clarion Scottsdale Peak, phone: 480-333-3333,

address: 3000 N. Scottsdale Rd, city: Scottsdale, state: AZ, zip: 85255}

• key: CAS_021 { name: W Hotel, phone: 415-222-2222,

address: 181 3rd Street, city: San Francisco, state: CA, zip: 94103}

• key: NYN_042 { name: Waldorf Hotel, phone: 212-555-5555,

address: 301 Park Ave, city: New York, state: NY, zip: 10019}

}

21

Rows

• Cassandra is column-oriented database. each row doesn’t have to have a same number of columns (as in relational database). Each row has a unique key, which makes it data accessible.• Each column family is stored in a separate file.

22

Columns

• The column, which is a name/value pair (and a client-supplied timestamp of when it was last updated), and a column family, which is a container for rows that have similar, but not identical, column sets. each column has an extra column for time stamp which records the time when last column was last updated. rows does not have timestamp• columns are name/value pairs, but a regular column stores a byte

array value

23

Super column

• The value of a super column is a map of subcolumns (which store byte array values). • it’s important to keep columns that you are likely to query together in

the same column family, and a super column can be helpful for this.• Super columns are not indexed.• Cassandra looks like a four-dimensional hash table. But for super

columns, it becomes more like a five-dimensional hash:[Keyspace][ColumnFamily][Key][SuperColumn][SubColumn]

24

Some points

• You cannot perform joins in Cassandra. If you have designed a data model and find that you need a join, you’ll have to either do the work on the client side, or create a denormalized second column family that represents the join results for you.• It is not possible to sort by value, it can only sort by column name in

order to fetch individual columns from a rows without pulling entire row into memory. • Column sorting is controllable, but key sorting isn’t row keys always

sort in byte order.

25

Elastic/Highly Avaliable

• Read and write throughput both increase linearly as new machine are added.• No downtime or interruption to application.

26

Sharding basic strategies

• feature base or functional segmentation: sharding will feature based with no common features like user details and items for sale will be different shards, movie rating and comments will be in different shards.• key based sharding: a key in data that will evenly distribute it across

shards. So instead of simply storing one letter of the alphabet for each server as in the (naive and improper) earlier example, you use a one-way hash on a key data element and distribute data across machines according to the hash.• lookup table: a table with contain information regarding the location of

the actual data.

27

Design Pattern

1. Materialized View (one table per query): create a secondary index to represent the additional query. “materialized” means storing a full copy of the original data so that everything you need to answer a query is right there, without forcing you to look up the original data. If you are performing a second query because you’re only storing column names that you use, like foreign keys in the second column family, that’s a secondary index.

28

Design Pattern (Cont.)

2. Valueless column: storing column value as column name. like in user/usercity we can have city name as key and users of that city as column names.

3. Aggregate key: key should be unique so it is possible to add two column value with a separator to create a aggregate key.

29

Reference

• Assembled using various resources over internet.

Thank You

Introduction to cassandra

Technology

Transcript of Introduction to cassandra