Introduction to Data Modeling in Cassandra

Post on 21-Aug-2015

134 views 10 download

Transcript of Introduction to Data Modeling in Cassandra

Introduction to Data Modeling in Cassandra

BarCamp Kerala 2015

Who am I?

Software Engineer at RapidValue Backend Engineer of Gudly Author of Flask-CQLAlchemy

What is Cassandra?

Massively linearly scalable NoSQL database High throughput with nearly linear scaling with proper use

cases Row-column oriented with SQL like approach using CQL

Brief History

Created by Avinash Lakshman(creator of Amazon's Dynamo) and Prashant Malik

Released as open source in 2008 Became an Apache top-level project in 2010

Best Use-Cases

Playlists & Collections Sensor Data Personalization and recommendation engines Messaging Fraud Detection

Notable features

No single point of failure Clearly defined table schema in a NoSQL environments Near linear horizontal scaling across commodity servers No joins

Brewer's Conjecturea.k.a “CAP Theorem”

Consistency – All nodes see the same data at any given time Availability – Every request receives a response whether is

succeeded or failed Partition Tolerance – Failure of a node does not bring the

system down Cassandra is a AP database

RDBMS vs CassandraQuerying

SQL for querying

SELECT * FROM users WHERE name = “John Doe”;

CQL for querying

SELECT * FROM users WHERE name = “John Doe”;

Data Modeling

Collection and analysis of data requirements Identification of participating entities and relationships Identification of data access patterns

Data Modeling

A particular way of organizing and structuring data Design and specification of a database schema Schema optimization and data indexing techniques

Products of Data Modeling

Conceptual Data model

Technology independent, unified views of data Entity-relationship model, dimensional model etc.

Conceptual Data Model Entity Relationship Diagram

Products of Data Modeling

Logical Data model

Unique for Cassandra Column family diagrams (Chebotko diagrams)

Modeling Guidelines Writes are cheap, reads are not Joins are not possible Duplication is good Indexing creates latency All data required to answer a query must be nested in a

column family

Data Modeling Methodology For each query, Identify a subset of the conceptual data model that describes

query data Apply a suitable mapping pattern on the subset and the

query Use Chebotko diagram to describe this as a logical model

Products of Data Modeling

Physical Data model

Unique for Cassandra CQL Definitions

Physical Data Model

CQL CREATE statement CREATE TABLE emp ( empID int, deptID int, first_name varchar, last_name varchar, PRIMARY KEY (empID, deptID) );

RDBMS vs Cassandra

Cassandra is equally good for complex and simple data All data required to answer a query must be nested in a

column family Data modeling methodology is driven by queries and data Data duplication is considered normal

Cassandra in Production Netflix Spotify Twitter

References http://academy.datastax.com http://www.slideshare.net/nkorla1share/cass-summit-3 http://docs.datastax.com http://planetcassandra.com

Contact

Email: iamgeorgethomas@gmail.com

Twitter: @_thegeorgeous

Github: http://github.com/thegeorgeous