Introduction to Data Modeling in Cassandra
Click here to load reader
-
Upload
george-thomas -
Category
Data & Analytics
-
view
133 -
download
10
Transcript of Introduction to Data Modeling in Cassandra
Introduction to Data Modeling in Cassandra
BarCamp Kerala 2015
Who am I?
Software Engineer at RapidValue Backend Engineer of Gudly Author of Flask-CQLAlchemy
What is Cassandra?
Massively linearly scalable NoSQL database High throughput with nearly linear scaling with proper use
cases Row-column oriented with SQL like approach using CQL
Brief History
Created by Avinash Lakshman(creator of Amazon's Dynamo) and Prashant Malik
Released as open source in 2008 Became an Apache top-level project in 2010
Best Use-Cases
Playlists & Collections Sensor Data Personalization and recommendation engines Messaging Fraud Detection
Notable features
No single point of failure Clearly defined table schema in a NoSQL environments Near linear horizontal scaling across commodity servers No joins
Brewer's Conjecturea.k.a “CAP Theorem”
Consistency – All nodes see the same data at any given time Availability – Every request receives a response whether is
succeeded or failed Partition Tolerance – Failure of a node does not bring the
system down Cassandra is a AP database
RDBMS vs CassandraQuerying
SQL for querying
SELECT * FROM users WHERE name = “John Doe”;
CQL for querying
SELECT * FROM users WHERE name = “John Doe”;
Data Modeling
Collection and analysis of data requirements Identification of participating entities and relationships Identification of data access patterns
Data Modeling
A particular way of organizing and structuring data Design and specification of a database schema Schema optimization and data indexing techniques
Products of Data Modeling
Conceptual Data model
Technology independent, unified views of data Entity-relationship model, dimensional model etc.
Conceptual Data Model Entity Relationship Diagram
Products of Data Modeling
Logical Data model
Unique for Cassandra Column family diagrams (Chebotko diagrams)
Modeling Guidelines Writes are cheap, reads are not Joins are not possible Duplication is good Indexing creates latency All data required to answer a query must be nested in a
column family
Data Modeling Methodology For each query, Identify a subset of the conceptual data model that describes
query data Apply a suitable mapping pattern on the subset and the
query Use Chebotko diagram to describe this as a logical model
Products of Data Modeling
Physical Data model
Unique for Cassandra CQL Definitions
Physical Data Model
CQL CREATE statement CREATE TABLE emp ( empID int, deptID int, first_name varchar, last_name varchar, PRIMARY KEY (empID, deptID) );
RDBMS vs Cassandra
Cassandra is equally good for complex and simple data All data required to answer a query must be nested in a
column family Data modeling methodology is driven by queries and data Data duplication is considered normal
Cassandra in Production Netflix Spotify Twitter
References http://academy.datastax.com http://www.slideshare.net/nkorla1share/cass-summit-3 http://docs.datastax.com http://planetcassandra.com