DataStax: Dockerizing Cassandra on Modern Linux

Dockerizing Cassandra on Modern Linux

Transcript of DataStax: Dockerizing Cassandra on Modern Linux

Dockerizing Cassandra on Modern Linux

Myself & Instaclustr

• Adam Zegelin — Founding Software Engineer & Co-founder of [email protected] · @zegelin

• Managed DataStax Enterprise and Apache Cassandra in the ☁ (AWS, Azure, SoftLayer)

• Self-service dashboard — create, manage & monitor clusters • 24/7/365 support, on-call engineers, uptime guarantee • Focus on developing your awesome apps — we handle the Cassandra

• Grew from a need for Cassandra in a project

2© 2015. All Rights Reserved.

Nodes — Software Stack

• CoreOS — lightweight OS • Docker — containerisation of everything • systemd — service managemen • journald — logging • D-Bus — controlling systemd from Java from inside containers

3© 2015. All Rights Reserved.

Initial Implementation

• Amazon Web Services only • Custom Ubuntu AMI (Amazon Machine Image)

• Based on stock Ubuntu AMI • 2 AMIs (PV/HVM) × 9 regions = 18 images per version!

(became unmaintainable very quickly)

• Custom cloud-init scripts — RAID disks, fetch config, etc. • Cassandra installed with apt-get install cassandra / dse

4© 2015. All Rights Reserved.

Initial Implementation — AWS

• We selected instance storage backed AWS instances • Instance storage is fast (SSDs) and low latency (local disk) but is volatile

— terminate the instance and all your data is gone • The alternative, EBS (Elastic Block Storage), is basically SAN — slower,

higher latency and originally shared instance network bandwidth • The newer c4.x and m4.x instances are “EBS optimised” and don’t share these limitations

• Only way to change AMI is to start a new machine • Not possible to use immutable images with persistent ephemeral data

• Only feasible solution for updates is apt-get install

5© 2015. All Rights Reserved.

• One of the first “Docker Operating Systems” • Available on every provider we support — AWS, Azure, SoftLayer • CoreOS has pre-built images

• Small and minimalist — not much userland (not even man!) • Other useful software — etcd, fleet, etc.

(we currently don’t use them — but maybe in the future) • In-use by some big players (Rackspace, PlayStation, Instaclustr 😀 ) • Recent funding from Google Ventures

6© 2015. All Rights Reserved.

• Container runtime + standardised image distribution & hosting + ecosystem • Private image hosting options available, such as

• Immutable images — Yay! 🎉 • Images running in dev, test and production environments are equal • Software installs, upgrades and uninstalls are clean • Components are isolated — potentially conflicting components (different library

versions, JVM versions, etc.) can co-exist • Even different userland layouts (Ubuntu, Debian, CentOS, etc)

7© 2015. All Rights Reserved.

• We containerise everything — C*, internal services, node management and monitoring apps

• Single, well understood, image build and deploy process — docker build & docker push

• Executed via Makefiles — one Make target per image — make push-all builds and pushes everything

• Helps that all our internal apps are Java-based too

8© 2015. All Rights Reserved.

• Docker gives us immutable images for our components without instance replacement

• CoreOS handles the rest (OS-level) via in-place updates

• Docker is provider agnostic • CoreOS runs on all major cloud providers and bare-metal

• The result ☞ Instaclustr-managed C* can run anywhere #

9© 2015. All Rights Reserved.



• CoreOS uses systemd for service management • systemd supports inter-service dependencies

• e.g. cassandra-backups.service “wants” cassandra.service • aka, cassandra-backups can only run when cassandra is running

• systemd can automatically restart services • Instaclustr services are fail-fast • Cassandra not so much — in some cases — watchdog?

10© 2015. All Rights Reserved.

systemd cont’d

• Manages units of different types — service, timer, target, etc. • service units manage processes • timers start services on a schedule (ala cron) • targets are for grouping/sync points

• “wants” cassandra.service, monitoring.serivce, datastax-agent.service, backups.timer, etc

• All units can define dependencies and conflicts • Dependencies of different “strengths” — Wants vs. Requires • In both directions — Requires and RequiredBy

11© 2015. All Rights Reserved.

Basic Integration

• Cassandra runs as PID 1 in the container • 1 primary process per container model

• Runs in foreground mode (-f) • Responds to SIGTERM via docker stop, systemctl stop, etc

• Cassandra data and configuration is persistent on host • Survives container restart • Cassandra data and configuration directories mounted from host

docker run -v /var/lib/instaclustr/etc/cassandra:/etc/cassandra …

12© 2015. All Rights Reserved.

Basic Integration cont’d

• Docker containers managed via systemd • cassandra.service execs docker run cassandra … • systemctl [start|stop|restart|status|…] cassandra

• Cassandra logging configured to write only to stdout • systemd logging best practice • Cassandra ⇢ Docker ⇢ systemd ⇢ journald

• journalctl -u cassandra

13© 2015. All Rights Reserved.

Basic Integration — Issues

• systemd starts dependent units when state is active • process running = service active — unless configured otherwise

• ∴ dependent units start immediately • process can hang but service stays active

14© 2015. All Rights Reserved.

Cassandra Startup

• JVM starts quickly • JMX (nodetool) connectivity is available early

• Objects are exposed where they are constructed • CQL/Thrift available late

• Can be toggled via cassandra.yaml or JMX/nodetool

• When is Cassandra “running”? • When does cassandra.service transition from activating to active?

• When do dependent services start?

15© 2015. All Rights Reserved.


• RPC between processes • Notifications • Socket-based (typically UNIX sockets, but can be TCP)

• Accessible inside a container — mount the socketdocker run -v /run/dbus:/run/dbus -v /run/systemd:/run/systemd …

• Multiple language bindings, including Java

16© 2015. All Rights Reserved.

D-Bus cont’d

• systemd is controlable via D-Bus • Control host systemd inside a Docker container • No need to fork/exec to run systemctl and co.

(in-fact, systemctl is a wrapper around D-Bus calls)

17© 2015. All Rights Reserved.

D-Bus cont’d

Java bindings — dbus-java systemctl restart cassandra ≝ systemdManager.RestartUnit("cassandra.service", "replace");

18© 2015. All Rights Reserved.

Enhanced Integration

• Service status = “active” — process running, or something more? • Cassandra java process running vs. C* accepting CQL connections

• CQL clients are dependencies, but shouldn’t start until CQL is available • Clients could fail-fast on no connectivity

• Will be automatically restarted • Service will oscillate between active and failed — hard to detect

actual failures • systemd will eventually timeout or give up — configurable • JVM startup can be expensive — CPU usage spikes

19© 2015. All Rights Reserved.

Enhanced Integration cont’d

• systemd targets for CQL & Thrift — • Life-cycle tracks internal C* service

• i.e., Starts when CQL is available — not immediate • nodetool disablebinary implies systemctl stop • Services that require CQL connectivity use • Starting starts these services too • Inverse of Wants

20© 2015. All Rights Reserved.

Enhanced Integration cont’d

• Java Agent side-loaded into Cassandra JVM • Hooks into CQL/Thrift service life-cycle

• Implemented using runtime byte-code modification • Controls systemd via D-Bus to start/stop associated

target units • But Cassandra is open-source — why not modify‽

• Agents work with DSE & Apache Cassandra

21© 2015. All Rights Reserved.

Java Agent

• Java Agents (java.lang.instrument) • java -javaagent:instaclustr-agent.jar …

• premain(…) method called at JVM startup • can hook into JVM class-loading, transform byte-code, etc.

• Javassist, ASM — byte-code modification libraries

22© 2015. All Rights Reserved.

Hookspublic interface Server { public void start(); public void stop();


// in CassandraDaemon:

// ThriftthriftServer = new ThriftServer(rpcAddr, rpcPort, listenBacklog); ⋮thriftServer.start(); ⋮thriftServer.stop(); // CQLnativeServer = new org.apache.cassandra.transport.Server(nativeAddr, nativePort); ⋮ nativeServer.start(); ⋮ nativeServer.stop();

23© 2015. All Rights Reserved.

Hookspublic static void premain(String agentArgs, Instrumentation inst) { inst.addTransformer((loader, className, classBeingRedefined, protectionDomain, classfileBuffer) -> { if (!"org/apache/cassandra/transport/Server".equals(className)) return null; final ClassPool pool = ClassPool.getDefault(); try { final CtClass ctClass = pool.get("org.apache.cassandra.transport.Server"); // patch start() and stop() methods of the Server class { final CtMethod method = ctClass.getDeclaredMethod("start"); method.insertAfter("com.instaclustr.Agent.serverStarted($0);"); } { final CtMethod method = ctClass.getDeclaredMethod("stop"); method.insertAfter("com.instaclustr.Agent.serverStopped($0);"); } byte[] byteCode = ctClass.toBytecode(); ctClass.detach(); return byteCode; // return the modified byte-code } catch (final Exception e) {…} return null; });}

// called when Server started — call systemd via dbus-java to start public static void serverStarted(final CassandraDaemon.Server server) {…}

// called when Server stopped — call systemd via dbus-java to stop cassandra-cql.targetpublic static void serverStopped(final CassandraDaemon.Server server) {…}

24© 2015. All Rights Reserved.

Docker Limitations and Sore Spots

• docker run is just a TTY proxy — actual container process is under the docker dæmon process/cgroup

• systemd requires startup & watchdog notifications to originate from started process, child, or process in same cgroup

• docker crash = all containers go bye-bye • docker … everything — inc. image downloads & builds — runs as

root in the dæmon! • processes inside containers are run un-elevated

25© 2015. All Rights Reserved.


• Devel. systemd can now launch Docker containers natively via machinectl

• Tighter integration with systemd • Process hierarchy is correct — right cgroup and parents • Java Agent can notify systemd for startup, status &

watchdog — via JNA + libsystemd

26© 2015. All Rights Reserved.