Download - IMPROVING CASSANDRA LATENCY AND RESILIENCY WITH … · 12/5/2018 · cassandra_volume1/_data ) • The docker node is started with a volume mapping to an internal path (e.g. /mnt

IMPROVING CASSANDRA LATENCY AND RESILIENCY WITH NVMe over Fabrics

VIKING ENTERPRISE SOLUTIONS, A Sanmina Company

NVMe Developer Days December 4th, 2018

Kais Belgaied & David Paulsen

APACHE CASSANDRA

2

§  Distributed database built for high-speed online transactional data -  Deployments include Netflix, Hulu, The Weather Channel, Comcast, Ebay, Intuit, CERN, and

1500+ other companies that have large active datasets -  Handles Up to 75K nodes, 10PB, billions of requests per day, at linear scale performance

§  Fault Tolerant -  Data automatically replicated to multiple nodes -  Replication across multiple Data centers

§  Decentralized -  Peer-to-peer architecture with no single point of failure -  No network bottleneck


CASSANDRA ARCHITECTURE ELEMENTS

3

§  Node §  where data is stored- basic infrastructure components §  A consistent hash maps the database’s row keys to tokens §  Each node in a data center is responsible for a token range

q è Quick data placement §  Any node can respond to queries

§  Data is replicated ‘N’ (N: replication factor) times to adjacent nodes, with a circular (ring) mapping

§  Multi-Datacenter awareness. Rings will span multiple data centers for failover and disaster recovery


TRADITIONAL CASSANDRA RECOVERY FROM NODE FAILURE

4

§  Storage is attached to the Node è When a node breaks: ‒  If it is repaired soon enough (within 3 hours by default), then

any missed writes will be distributed back to it. ‒  If it needs to be replaced, then the node content is rebuilt from

the other replicas ‒  If it needs to be removed, then the ring re-balances by

redistributing the data to other nodes and restores the replica count

§  What if the disk is perfectly fine?

1

5

7 3

2 8

6 4

Client

A A

A

Replication coordinator node

A

1st replica node

VIKING ENTERPRISE SOLUTIONS , A Sanmina Company

CASSANDRA ON NVME-OF

5

Decouple the data storage from the node Assign the hosts exclusive protected access to NVMe namespaces on the fabric

è  The fabric management software controls visibility and arbitrates access è  The fabric can scale across the data center

NDS-2248F NDS-2248F

NVMe over Fabric Disks and Servers

Storage Fabric Network

Cassandra Nodes host1 host2 host3 host4 host5

nvme

2 nv

me3

nvme

4

Cassandra cluster

nvme

2 nv

me3

nvme

4


RESILIENCY: RECOVERING FROM A NODE FAILURE

6

1.  Host4 unavailable 2.  Disable nvme4 from host4 3.  Select replacement host:

host5 4.  Enable host5 connectivity to

nvme4 5.  Resume service on host5

with nvme4

NVME-oF server NVME-oF server

host1 host2 host3 host4 host5

nvme

2 nv

me3

nvme

4

Cassandra cluster

2

1

X 4

Split-brain avoidance guarantee: Should host4 come back and all further write() ops are rejected

nvme

2 nv

me3

nvme

4

2 X 4


IMPACT ON AVAILABILITY AN RESILIENCY

7

§  Cassandra intentionally delays node removal – 3 hours out of the box -  Reason: very costly data movement caused by rebalancing the keyspace -  While waiting, portion of the data is vulnerable: reduced replication -  This is done even when the failure is caused to network or server outage, while the data (storage) is

still healthy §  With a disaggregated fabric topology, such as NVME-OF, we can

-  Give-up on defective servers/network quicker, with more realistic timeouts -  Start the replacement with a physical standby node when possible

-  Variant: Use a virtual or containerized node in the interim -  Shorten the window of data vulnerability -  Reduce the amount of write catch up to recover. Shorter time to recover means fewer writes

accumulated


CASSANDRA WITH DOCKERS

8

•  Each Cassandra node runs as/in a docker container •  The host/docker engine:

»  discovers the NVMe device off the farbric and connects to it (e.g. /dev/nvme0n1) è The shared NVMe-oF fabric network is protected from / not exposed to the containers è The containers remain stateless

»  Mounts the filesystem locally on the host (e.g. /var/lib/docker/volumes/cassandra_volume1/_data )

•  The docker node is started with a volume mapping to an internal path (e.g. /mnt/cassdata)

•  When a docker container dies •  Eligible new host takes ownership of the NVMe device(s)

è Hosts need to watch out for conflict in accessing disks •  Start the Cassandra node on the new host


SCALING CONSIDERATIONS

9

•  The NVMe-over-Fabric topology scales as far as the underlying fabric extends:

à L2 network – cascade of switches across multiple racks for InfiniBand, or RoCE à Across data centers for NVMe-OF with TCP

•  Tradeoffs of Latency to storage vs size of clusters vs fault domains •  Density of the managed entities

•  With as few as 3 node cluster, we needed to use Docker-swarm to deploy the cluster

•  Already shows difficulties describing the nodes and orchestrating the Cassandra services

è A scalable and robust orchestration service, such as Kubernetes, is needed for more common size deployments


RECOVERY ACTIONS & COSTS COMPARISON

10

Reliability Max Downtime/year Action & Cost Traditional

Action & Cost NVMe-OF

2N 99% 3 days 15 hours 36min Remove/Replace Very High

Repair Very Low

3N 99.9% 8 hours 45 min 36 sec Remove/Replace Very High

Repair Very low

4N 99.99% 52 min 33 sec Repair Low

Repair Low

5N 99.999% 5 min 15 sec Repair Very Low

Repair Very Low


CASSANDRA ON NVME-OF BENCHMARKING

11

§  Storage Elements of the NVMe Fabric -  Disks: Low latency INTEL® OPTANE™ SSD DC P4800X -  Fabric Enclosure: Newisys NDS-2248 (2U 24 disk) High Performance JBOF

§  Cassandra Cluster nodes and clients -  Intel 1U servers, Xeon E5 2667 v4, 128GB RAM as Cassandra nodes -  Each Cassandra node has 2 100Gbps connections

q One to the NVME-oF q One for Intra cluster Cassandra chatter (replication, gossip, etc.)Clients connect to any node


YAHOO CLOUD SERVING BENCHMARK - YCSB

12

https://github.com/brianfrankcooper/YCSB

§  YCSB load test runs comparing performance between local SSD drives and

NVMeOF connected drives

§  Default Cassandra configuration settings

§  recordcount 5000000, 3-5 minute runs

§  Random number of threads per run

§  We compare attached SSD to NVMe-oF performance

§  50-50 Read/Update transaction mix


YCSB TRANSACTIONS/S

13

The same nodes, used in two ways -  With their internal SSD -  With Optane SSD’s from

NVMe-oF Load generated from the same clients è With Internal SSD’s: Disk I/O is

the bottleneck è Not the case for NVMe disks


YCSB LATENCY

14

• As load increases: §  With their internal SSD: SATA queue

buildup becomes the major contributor to latency

§  With NVMe disks & NVMe-oF:

- No storage I/O queue buildup -  Increase in Cassandra chatter

becomes noticeable


CONCLUSIONS

15

§  NVMe-Over Fabric can yield better TCO -  Cheap HDD tend to cause inefficient utilization at large scale -  Loss of attached disks means wasting an otherwise healthy node -  Unnecessary reconstruction warrants over-provisioning

§  Low latency access to storage can make a difference -  High media latency cause queue buildup

§  Disaggregated Fabric topology, such as NVMe-oF improves resiliency and availability

§  Containers help with fast recovery, and require an orchestration service to scale at the level of the Fabric


THANK YOU

16 VIKING ENTERPRISE SOLUTIONS , A SANMINA COMPANY