IMPROVING CASSANDRA LATENCY AND RESILIENCY WITH NVMe over Fabrics
VIKING ENTERPRISE SOLUTIONS, A Sanmina Company
NVMe Developer Days December 4th, 2018
Kais Belgaied & David Paulsen
APACHE CASSANDRA
2
§ Distributed database built for high-speed online transactional data - Deployments include Netflix, Hulu, The Weather Channel, Comcast, Ebay, Intuit, CERN, and
1500+ other companies that have large active datasets - Handles Up to 75K nodes, 10PB, billions of requests per day, at linear scale performance
§ Fault Tolerant - Data automatically replicated to multiple nodes - Replication across multiple Data centers
§ Decentralized - Peer-to-peer architecture with no single point of failure - No network bottleneck
VIKING ENTERPRISE SOLUTIONS, A Sanmina Company
CASSANDRA ARCHITECTURE ELEMENTS
3
§ Node § where data is stored- basic infrastructure components § A consistent hash maps the database’s row keys to tokens § Each node in a data center is responsible for a token range
q è Quick data placement § Any node can respond to queries
§ Data is replicated ‘N’ (N: replication factor) times to adjacent nodes, with a circular (ring) mapping
§ Multi-Datacenter awareness. Rings will span multiple data centers for failover and disaster recovery
VIKING ENTERPRISE SOLUTIONS, A Sanmina Company
TRADITIONAL CASSANDRA RECOVERY FROM NODE FAILURE
4
§ Storage is attached to the Node è When a node breaks: ‒ If it is repaired soon enough (within 3 hours by default), then
any missed writes will be distributed back to it. ‒ If it needs to be replaced, then the node content is rebuilt from
the other replicas ‒ If it needs to be removed, then the ring re-balances by
redistributing the data to other nodes and restores the replica count
§ What if the disk is perfectly fine?
1
5
7 3
2 8
6 4
Client
A A
A
Replication coordinator node
A
1st replica node
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
CASSANDRA ON NVME-OF
5
Decouple the data storage from the node Assign the hosts exclusive protected access to NVMe namespaces on the fabric
è The fabric management software controls visibility and arbitrates access è The fabric can scale across the data center
NDS-2248F NDS-2248F
NVMe over Fabric Disks and Servers
Storage Fabric Network
Cassandra Nodes host1 host2 host3 host4 host5
nvme
2 nv
me3
nvme
4
Cassandra cluster
nvme
2 nv
me3
nvme
4
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
RESILIENCY: RECOVERING FROM A NODE FAILURE
6
1. Host4 unavailable 2. Disable nvme4 from host4 3. Select replacement host:
host5 4. Enable host5 connectivity to
nvme4 5. Resume service on host5
with nvme4
NVME-oF server NVME-oF server
host1 host2 host3 host4 host5
nvme
2 nv
me3
nvme
4
Cassandra cluster
2
1
X 4
Split-brain avoidance guarantee: Should host4 come back and all further write() ops are rejected
nvme
2 nv
me3
nvme
4
2 X 4
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
IMPACT ON AVAILABILITY AN RESILIENCY
7
§ Cassandra intentionally delays node removal – 3 hours out of the box - Reason: very costly data movement caused by rebalancing the keyspace - While waiting, portion of the data is vulnerable: reduced replication - This is done even when the failure is caused to network or server outage, while the data (storage) is
still healthy § With a disaggregated fabric topology, such as NVME-OF, we can
- Give-up on defective servers/network quicker, with more realistic timeouts - Start the replacement with a physical standby node when possible
- Variant: Use a virtual or containerized node in the interim - Shorten the window of data vulnerability - Reduce the amount of write catch up to recover. Shorter time to recover means fewer writes
accumulated
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
CASSANDRA WITH DOCKERS
8
• Each Cassandra node runs as/in a docker container • The host/docker engine:
» discovers the NVMe device off the farbric and connects to it (e.g. /dev/nvme0n1) è The shared NVMe-oF fabric network is protected from / not exposed to the containers è The containers remain stateless
» Mounts the filesystem locally on the host (e.g. /var/lib/docker/volumes/cassandra_volume1/_data )
• The docker node is started with a volume mapping to an internal path (e.g. /mnt/cassdata)
• When a docker container dies • Eligible new host takes ownership of the NVMe device(s)
è Hosts need to watch out for conflict in accessing disks • Start the Cassandra node on the new host
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
SCALING CONSIDERATIONS
9
• The NVMe-over-Fabric topology scales as far as the underlying fabric extends:
à L2 network – cascade of switches across multiple racks for InfiniBand, or RoCE à Across data centers for NVMe-OF with TCP
• Tradeoffs of Latency to storage vs size of clusters vs fault domains • Density of the managed entities
• With as few as 3 node cluster, we needed to use Docker-swarm to deploy the cluster
• Already shows difficulties describing the nodes and orchestrating the Cassandra services
è A scalable and robust orchestration service, such as Kubernetes, is needed for more common size deployments
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
RECOVERY ACTIONS & COSTS COMPARISON
10
Reliability Max Downtime/year Action & Cost Traditional
Action & Cost NVMe-OF
2N 99% 3 days 15 hours 36min Remove/Replace Very High
Repair Very Low
3N 99.9% 8 hours 45 min 36 sec Remove/Replace Very High
Repair Very low
4N 99.99% 52 min 33 sec Repair Low
Repair Low
5N 99.999% 5 min 15 sec Repair Very Low
Repair Very Low
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
CASSANDRA ON NVME-OF BENCHMARKING
11
§ Storage Elements of the NVMe Fabric - Disks: Low latency INTEL® OPTANE™ SSD DC P4800X - Fabric Enclosure: Newisys NDS-2248 (2U 24 disk) High Performance JBOF
§ Cassandra Cluster nodes and clients - Intel 1U servers, Xeon E5 2667 v4, 128GB RAM as Cassandra nodes - Each Cassandra node has 2 100Gbps connections
q One to the NVME-oF q One for Intra cluster Cassandra chatter (replication, gossip, etc.)Clients connect to any node
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
YAHOO CLOUD SERVING BENCHMARK - YCSB
12
https://github.com/brianfrankcooper/YCSB
§ YCSB load test runs comparing performance between local SSD drives and
NVMeOF connected drives
§ Default Cassandra configuration settings
§ recordcount 5000000, 3-5 minute runs
§ Random number of threads per run
§ We compare attached SSD to NVMe-oF performance
§ 50-50 Read/Update transaction mix
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
YCSB TRANSACTIONS/S
13
The same nodes, used in two ways - With their internal SSD - With Optane SSD’s from
NVMe-oF Load generated from the same clients è With Internal SSD’s: Disk I/O is
the bottleneck è Not the case for NVMe disks
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
YCSB LATENCY
14
• As load increases: § With their internal SSD: SATA queue
buildup becomes the major contributor to latency
§ With NVMe disks & NVMe-oF:
- No storage I/O queue buildup - Increase in Cassandra chatter
becomes noticeable
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
CONCLUSIONS
15
§ NVMe-Over Fabric can yield better TCO - Cheap HDD tend to cause inefficient utilization at large scale - Loss of attached disks means wasting an otherwise healthy node - Unnecessary reconstruction warrants over-provisioning
§ Low latency access to storage can make a difference - High media latency cause queue buildup
§ Disaggregated Fabric topology, such as NVMe-oF improves resiliency and availability
§ Containers help with fast recovery, and require an orchestration service to scale at the level of the Fabric
VIKING ENTERPRISE SOLUTIONS , A Sanmina Company
THANK YOU
16 VIKING ENTERPRISE SOLUTIONS , A SANMINA COMPANY
Top Related