Cassandra at Glogster

download Cassandra at Glogster

of 19

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Cassandra at Glogster

  • Cassandra at Glogster Roman Komkov

    System Engineer at Glogster

    Prague Cassandra Meet up 03.09.2015

  • About me

    2 years at Glogster EDU as System Engineer

    5+ years of Linux administration

    5+ years of Python development

    Cluster, HA, Orchestration

    CI, CD

    Twitter - @alkoengineering

    GitHub, Freenode - decayofmind

  • About Glogster EDU

    Started in 2009

    Platform for presentation and interactive learning mainly used by educators and students

    19 million users

    Over 45 million glogs

    40000 new glogs daily

    Web service, mobile applications

  • Cassandra at Glogster

    From 2011 as primary DB for initial

    From 2012 as backend (storage) DB for Glogster EDU

    Started from 0.6 or 0.8, I guess

    10 nodes

    RF=5, QUORUM

    SATA disks

    OrderPreservingPartitioner \_()_/

  • Architecture

  • Cassandra now

    5 nodes cluster

    ~600Gb average node size

    RF=5, QUORUM

    SSD disks



    pycassa + datastax-driver

  • 0.8 problems

    Migration with downtime by transferring a copy of data

    HintedHandoff hell

    No repairs, no cleanups

    Enormous HeapSize (20GB)

    Different time on servers


    Upgrade to 1.0

  • 1.1 problems

    Cassandra guy left Glogster

    Dont touch it while it works


    Load averages like 14.0-16.0

    2 disks failed

    Everything is slow

    Repairs? Never heard!

  • 1.1 solutions

    Replace disks, rebuild nodes. Dont try to run repair on new node instead of ReplaceToken

    Move old keyspace to another cluster Load gone

    Nodes are fast again

    Regular repairs and cleanups? Never did!

    OpsCenter installed

    Cluster upgraded to 1.2

  • 1.2 and migration

    Cluster migrated to the new servers without downtime


  • Old datacenter, connected to production was disconnected from new datacenter

    Forgot about Hints TTL (max_hint_window_in_ms ~ 3 hours)

    Forgot to run repair on cluster after

    Old DC was decommissioned

    Application switched the new one


  • Here the hell begins

    ~ 1200 glogs remain on old decommissioned datacenter

    Thanks God, we have RF=

    Transfer data from one old node to the new server

    Run Cassandra on it, add node to the cluster

    Run repair on entire cluster

    Increase repair chance with read_repair_chance

    Peacefully wait until done

    Do your complicated repairs through OpsCenter, cause it can continue if failed.

  • Full repair?

  • 10 DAYS!!!

  • Conclusions and Improvements

    Increase max_hint_window_in_ms value to something like 3 days

    Make use of parallel things

    CQL3 and datastax-driver

    Upgrade to Cassandra 2.2 faster repairs and other operations New OpsCenter

    Schedule regular backups and repairs

    We still love Cassandra!

  • Questions?