Presentation of Apache Cassandra

download Presentation of Apache Cassandra

If you can't read please download the document

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Presentation of Apache Cassandra

Apache Cassandra

Apache Cassandra

Introduction to NoSQL systems, Extensible Record Stores and Amazons Dynamo + Google Bigtable What Cassandra is and how it is compared with other similar systemsWhat applications are better supported - examples, case studiesTechnical Description, architecture, internalsHow is it used and installed, requirements and in what platforms does it run onDemoReferencesContents

BackgroundNoSQL, Extensible Record Stores, Cassandras Parents1.

NoSQL or Not-Only-SQL systems: Next Generation Databases. The initial movement started in 2009 with the goal of creating modern, web-scale DBs. Currently, they exist more than 225 NoSQL systems.In general, they share the following features:

Schema-free databasesEasy replication supportSimple APIDistributedOpen SourceNoSQL SystemsBASE (instead of ACID)Huge amount of dataHorizontally scalable

Sources: Basic Availability, Soft state, Eventual consistencyACID (Atomicity, Consistency, Isolation, Durability)

Motivated by Googles Big Table.Basic Data Model: Rows and ColumnsBasic Scalability Model: Rows and Columns are splitted into nodes.Rows: split across nodes through sharding on the primary key.Columns: distributed over multiple nodes by using column groups.Other systems that use this technology: Hypertable, HBase.Extensible Record Stores (or Wide Column Stores)

Source: noSql paper

What is it?A highly-available and scalable storage system used by Amazon to store and retrieve user shopping charts and other core services. It pioneered the idea of eventual consistency. Key-Value Store.How it works?

Allows read and write operations to continue even during network partitions and resolves update conflicts using different conflict resolution mechanisms.Sacrifices consistency for availability.Allows customization to meet desired preference.Consistent Hashing, Vector Clocks (not in Cassandra), Gossip Protocol, Hinted Handoff, Read Repair Cassandras Parents - Amazon Dynamo

Source: dynamo paper, cassandra paper, nosql paperData fetched are not guaranteed to be up-to-date but updates are guaranteed to be propagated to all nodes eventually.

Cassandras Parents - Google BigtableWhat is it?A high performance data storage system built on Google File System and other Google technologies.How it works?Provides both structure and data distribution but relies on a distributed file system for durability.Richer data model from Dynamo. One key, many values. Fast sequential access.Columnar, SSTable Storage, Append-only, Memtable, Compaction

What features does Cassandra use from Googles BigTable? Column FamiliesMemtablesSSTables

What features does Cassandra use from Amazon Dynamo?Consistent hashingPartitioningReplication

Cassandras Parents

Cassandra and Parents

Description and ComparisonsWhat Cassandra is and how it is compared with other similar systems2.

Avinash LakshmanInventor, Apache CassandraCo-inventor, Amazon Dynamo

Prashant MalikInventor, Apache CassandraTechnical Leader, Facebook

What is cassandra?

DefinitionA distributed NoSQL database system for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.

Timeline with activitiesJuly 2008Facebook released Cassandra as an open-source project March 2009Cassandra became an Apache Incubator project17th February 2010Cassandra graduated to a top-level project2012University of Toronto researchers studying NoSQL systems concluded that In terms of scalability, there is a clear winner throughout our experiments2010-2015New releases of Cassandra

StrengthsLinear scale performanceThe ability to add nodes without failures leads to predictable increases in performanceSupports multiple languagesPython, C#/.NET, C++, Ruby, Java, Go, and many moreOperational and developmental simplicityThere are no complex software tiers to be managed, so administration duties are greatly simplified.Ability to deploy across data centresCassandra can be deployed across multiple, geographically dispersed data centres

Cloud availabilityInstallations in cloud environments Peer to peer architectureCassandra follows a peer-to-peer architecture, instead of master-slave architectureFlexible data modelSupports modern data types with fast writes and readsFault toleranceNodes that fail can easily be restored or replacedHigh PerformanceCassandra has demonstrated brilliant performance under large sets of data

Strengths (1)

ColumnFamily Store Cassandra stores columns based on the column names, leading to very quick slicingTunable consistencySupport for strong or eventual data consistency across a widely distributed clusterSchema-free/Schema-lessIn Cassandra, columns can be created at your will within the rows. Cassandra data model is also famously known as a schema-optional data modelAP-CAPCassandra is typically classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than consistency in Cassandra

Strengths (2)

CAP and Cassandra

Variable number of columns per row

WeaknessesUse Cases where is better to avoid using CassandraIf there are too many joins required to retrieve the dataTo store configuration dataDuring compaction, things slow down and throughput degradesBasic things like aggregation operators are not supportedRange queries on partition key are not supportedIf there are transactional data which require 100% consistencyCassandra can update and delete data but it is not designed to do so

Business InsiderThe basic problem Cassandra solved is that when you have a lot of data sitting on a lot of servers, as Facebook does, you end up with a house of cards. A single server going down can collapse the whole stack.

Cassandra compared to other NoSQL Systems

Read & Write latency for workload Read/Write

Throughput for workload Read/Write & Read/Scan/Write

Insert-mostly Workload

Mixed Operational & Analytical Workload

Read-Modify-Write Workload

Balanced Read/Write Mix

Read-mostly Workload

Load Process

VLDB Benchmark (RWS)

Differences between Cassandra and RDBMS RDBMS Cassandrarelational databasekeyspaceb-treeslog-structured merge-treesrows which do not include a particular column value NULL (in that position)for each row, only the columns with a value are storedsupport ACID transactionsonly supports AID

ACID (Atomicity, Consistency, Isolation, Durability)

Stats provided by authors using Facebook data

Supported Applications - Customers - Case Studies3.

What kind of applications are supported by Cassandra>80% of the clients fit into one of the next categories:Product Catalog/PlaylistRecommendation/Personalization EngineSensor Data/Internet of ThingsMessaging (and generally time-series data)Fraud Detection

In other words, applications that need and handle time-series data (most common use case)store and handle large volumes of datascale predictablybe continuously availableprotect their data

DatastaxA software company that develops and provides support for a commercial edition of Cassandra.Massively scalable NoSQL platform able to run online applications for innovative and data-intensive companies (e.g. Netflix).Faster to deploy and less expensive to maintain than other database platforms.Powered by Cassandra and contains only selected releases of it, chosen by its expert staff.

Source: cloud_cassandra paper

Datastax (1)Supports businesses that need a progressive data management.Can serve as a real-time datastore for online production.Delivers a unique, smart data platform, suitable for the cloud.

CustomersOver 3.000 companies around the world use (or have used) Apache Cassandra in production.Most famous:

Cassandra SummitOrganized by DataStax for 7 consecutive years (in both US and Europe).New product releases are announced.Customers describe their usage of Cassandra

Key TermsClusterDistributed LocationNode


Category: Messaging

Facebook Inbox Search - Requirements The system was required to handle a very high write throughput, billions of writes per day, and also scale with the number of usersSince users are served from data centres that are geographically distributed, being able to replicate data across data centres was key to keep search latencies downLakshman, Malik

Facebook Inbox Search The reason why Cassandra was initially built.Facebook maintains a per user index of all messages that have been exchanged between the senders and the recipients of the message. Two kinds of search features enabled at 2008:term searchinteractions - given a persons name, returns all the messages have been sent/received from that person

Facebook Inbox Search (1)How did they do that?The schema consists of two column families. Exploits the time sorting feature of Cassandra.For the term search:UserID key Words that make up the message super columnsColumns within the super column individual message identifiers (MessageID) of the messages that contain the word.

Facebook Inbox Search (2)For the interactions:UserID key RecipientsIDs super columnsColumns within the super columns MessageIDsCassandra provides certain hooks for intelligent caching of data

Inbox Search Schema

Adjusted from:

Facebook Inbox Search (3