Apache cassandra an introduction

download Apache cassandra  an introduction

of 40

  • date post

    12-Apr-2017
  • Category

    Technology

  • view

    218
  • download

    2

Embed Size (px)

Transcript of Apache cassandra an introduction

  • Apache Cassandra:A Brief History: Dive into the Dynamo whitepaper

  • About me@Shehaaz

    I love hacking on Wearable/iOT devices.

  • Topics Today History and Dynamo Time series data modeling Example App

  • History Peer-to-Peer (All nodes are EQUAL)

    Centralized peer-to-peer networks Node connects to Directory server.

    e.g: Napster Unstructured networks

    Nodes randomly connect to each other e.g: Kazaa, Gossip

    Structured networks Nodes organized into a specific topology (consistent Hashing)

    e.g: Cassandra Ring

  • Napster: Centralized P2P

  • Road to Cassandra 1999: Napster and other questionable P2P services 2006: Google Big Table

    C* has similar data storage.

    2007: Amazon Dynamo (Avinash Lakshman) C* has similar architecture

    2008: Facebook Open Sourced C* (Avinash Lakshman)

  • CAP Theorem Consistency

    All nodes see the same data at the same time Availability

    A guarantee that every request receives a response about whether it succeeded or failed Partition Tolerance

    The system continues to operate despite arbitrary message loss or failure of part of the system

    e.g: Increasing Availability (increase Rep.Factor) Reduce Consistency. You can only have two out of the three!

  • DynamoThe motivation: You must ALWAYS be able to add to your

    shopping cart! (High Availability) Conflict resolution is done at the application:

    merge conflicting shopping carts. Primary Key access to data store (RDB limitations)

    e.g: best seller list, customer preferences, etc

  • Dynamo ArchitectureKey principles:

    1. Incremental scalability Add nodes w/o disrupting system

    2. Symmetry Every node has same responsibility

    3. Decentralization peer-to-peer over centralized control

    4. Heterogeneity The work distribution must be

    proportional to the capabilities of the individual servers.

  • Distributed Hash TableData OrganizationDistributed Hash Table (DHT) using Consistent Hashing:The keys are mapped to form a ring. The output range of the hash function is treated as a fixed circular ring. (i.e: The largest Hash Value wraps around to the smallest hash value)

  • Inserting data: High LevelHash(RowKey) = 4500

    circle clockwise and insert in Node 5

  • Row Level Hashing?

    1 T:22:00:02, HR:71 T:22:00:01, HR:72

    2 T:22:00:05, HR:90 T:22:00:02, HR:95

    Patient ID (Partition Key) Event Time (Clustering Column)

  • Dynamo ArchitectureConsistent Hashing Advantage:

    Departure or Arrival of a node only affects immediate neighbors. Every node is in charge of the previous node clockwise.

    Only K/N nodes need to be remapped when a node drops. K= #keys N= #Nodes

    Disadvantage: ?

  • Dynamo Architecture

  • Dynamo ArchitectureConsistent Hashing Disadvantage?

  • Dynamo ArchitectureConsistent Hashing Disadvantage

    Random Node position assignment leads to non-uniform data and load distribution

    Some nodes could simply suck

  • Disadvantage Diagram

  • Virtual nodes to rescue! Instead of mapping a node to a single

    point in the ring, each node gets assigned to multiple locations in the ring.(what does that mean?)

    Virtual Nodes!

  • Virtual NodesThree node cluster with zero V-nodes

    p = Position

  • Virtual Nodes

    V-Nodes look like nodes in the system Regular node can be responsible for more

    than one V-Node

  • Virtual Nodes: Add NodeAdding a new Node:

    This will evenly balance the data in the cluster. Server #4 will get data from all the servers.

    How? Server 4 is next to 1,2 and 3

  • V-Nodes: Remove NodeWhen a node goes down the data is evenly distributed.

    When #1 went down, #2 and #3 took over the data.

    If we didnt have virtual nodes #2 would have been overloaded.

  • ReplicationWhy?To achieve high availabilitye.g: Replication Factor: 3Hash(KEY1) = 500 Node #1 is the coordinator node for values 0 to 999Its job is to replicate it to TWO other nodes.In modern C* it is the job of the Node that received the write.

  • ReplicationServer 1 copies the data to TWO other nodes clockwise to satisfy Replication Factor: 3

    If 1 goes down 2 will make sure to keep R.F=3

  • Example Application Patient in critical care. Needs a vital sign

    dashboard Arduino based Heart Rate and spO2

    measuring device. Pretty graph and gain insight from the data

  • Arduino + e-Health PCB

  • System Diagram

  • Setup GCloud C* cluster

  • Requirements.txt

  • Example Code

  • Create Tables

  • Whats Wrong?1. We will eventually run out of columns. Cassandra

    allows 2 billions columns per row

    63.3 years

  • Whats Wrong?2. RowKey Hashing will create a hotspot in the cluster. (Remember Row Level Hashing?)

  • Data modeling in C*

    Time Series data modeling.

  • Create Tables

    A.K.A: Compound Row Key

  • Table

    1,2015-02-17 T:22:00:01, HR:71 T:22:00:00, HR:72

    2,2015-02-17 T:22:00:05, HR:90 T:22:00:02, HR:95

    Patient ID (Partition Key) Event Time (Clustering Column)

    Data is SORTED and stored Sequentially on Disk

  • Insert Data

  • Query Data

  • Demo Video

    http://www.youtube.com/watch?v=SDatAH0Gpgs

  • ResourcesAmazon Dynamo paper:http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

    Cassandra High Availability by Robbie Stricklandhttp://www.amazon.com/gp/product/1783989122/ref=cm_cr_ryp_prd_ttl_sol_0

    http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdfhttp://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdfhttp://www.amazon.com/gp/product/1783989122/ref=cm_cr_ryp_prd_ttl_sol_0http://www.amazon.com/gp/product/1783989122/ref=cm_cr_ryp_prd_ttl_sol_0