Cassandra at Glogster

19
Cassandra at Glogster Roman Komkov – [email protected] System Engineer at Glogster Prague Cassandra Meet up 03.09.2015

Transcript of Cassandra at Glogster

Page 1: Cassandra at Glogster

Cassandra at Glogster Roman Komkov – [email protected]

System Engineer at Glogster

Prague Cassandra Meet up 03.09.2015

Page 2: Cassandra at Glogster

About me

  2 years at Glogster EDU as System Engineer

  5+ years of Linux administration

  5+ years of Python development

  Cluster, HA, Orchestration

  CI, CD…

  Twitter - @alkoengineering

GitHub, Freenode - decayofmind

Page 3: Cassandra at Glogster

About Glogster EDU

  Started in 2009

  Platform for presentation and interactive learning mainly used by educators and students

  19 million users

  Over 45 million glogs

  40000 new glogs daily

  Web service, mobile applications

  http://edu.glogster.com

Page 4: Cassandra at Glogster

Cassandra at Glogster

  From 2011 as primary DB for initial Glogster.com

  From 2012 as backend (storage) DB for Glogster EDU

  Started from 0.6… or 0.8, I guess…

  10 nodes

  RF=5, QUORUM

  SATA disks

OrderPreservingPartitioner ¯\_(ツ)_/¯

Page 5: Cassandra at Glogster

Architecture

Page 6: Cassandra at Glogster

Cassandra now

  5 nodes cluster

  ~600Gb average node size

  RF=5, QUORUM

  SSD disks

VNodes

OrderPreservingPartitioner…

pycassa + datastax-driver

Page 7: Cassandra at Glogster
Page 8: Cassandra at Glogster

0.8 problems

  Migration with downtime by transferring a copy of data

HintedHandoff hell

  No repairs, no cleanups

  Enormous HeapSize (20GB)

  Different time on servers

SOLUTION!

  Upgrade to 1.0

Page 9: Cassandra at Glogster

1.1 problems

  Cassandra guy left Glogster

  Don’t touch it while it works

BUT…

  Load averages like 14.0-16.0

  2 disks failed

  Everything is slow

  Repairs? Never heard!

Page 10: Cassandra at Glogster

1.1 solutions

  Replace disks, rebuild nodes.   Don’t try to run repair on new node instead of ReplaceToken

  Move old Glogster.com keyspace to another cluster   Load gone

https://glogster.github.io/posts/2015/03/23/cassandra-migration.html

  Nodes are fast again

  Regular repairs and cleanups? Never did!

OpsCenter installed

  Cluster upgraded to 1.2

Page 11: Cassandra at Glogster

1.2 and migration

  Cluster migrated to the new servers without downtimehttp://www.planetcassandra.org/blog/cassandra-migration-to-ec2/

Vnodes

Page 12: Cassandra at Glogster
Page 13: Cassandra at Glogster

  Old datacenter, connected to production was disconnected from new datacenter

  Forgot about Hints TTL (max_hint_window_in_ms ~ 3 hours)

  Forgot to run repair on cluster after

  Old DC was decommissioned

  Application switched the new one

  …

DATA GONE

Page 14: Cassandra at Glogster

Here the hell begins

  ~ 1200 glogs remain on old decommissioned datacenter

  Thanks God, we have RF=<N of nodes>

  Transfer data from one old node to the new server

  Run Cassandra on it, add node to the cluster

  Run repair on entire cluster

  Increase repair chance with read_repair_chance

  Peacefully wait until done…

  Do your complicated repairs through OpsCenter, cause it can continue if failed.

Page 15: Cassandra at Glogster

Full repair?

Page 16: Cassandra at Glogster
Page 17: Cassandra at Glogster

10 DAYS!!!

Page 18: Cassandra at Glogster

Conclusions and Improvements

  Increase max_hint_window_in_ms value to something like 3 days

  Make use of parallel things

  CQL3 and datastax-driver

  Upgrade to Cassandra 2.2   faster repairs and other operations   New OpsCenter

  Schedule regular backups and repairs

  We still love Cassandra!

Page 19: Cassandra at Glogster

Questions?