Building reliable Ceph clusters with SUSE Enterprise Storage

Survival skills for the real world

Lars Marowsky-BréeDistinguished Engineerlmb@suse.com

What this talk is not

● A comprehensive introduction to Ceph

● SUSE Enterprise Storage roadmap session

● A discussion of Ceph performance tuning

SUSE Enterprise Storage - Reprise

The Ceph project● An Open Source Software-Defined-Storage project

● Multiple front-ends

– S3/Swift object interface

– Native Linux block IO

– Heterogeneous Block IO (iSCSI)

– Native Linux network file system (CephFS)

– Heterogeneous Network File System (nfs-ganesha)

– Low-level, C++/Python/… libraries

– Linux, UNIX, Windows, Applications, Cloud, Containers

● Common, smart data store (RADOS)

– Pseudo-random, algorithmic data distribution

Software-Defined-Storage

Ceph Cluster: Logical View

OSDOSDOSD

OSD OSD OSD

iSCSIGateway

S3/SwiftGateway

NFSGateway

Introducing Dependability

Introducing dependability

● Availability

● Reliability

– Durability

● Safety

● Maintainability

The elephant in the room

● Before we discuss technology ...

● … guess what causes most outages?

Improve your human factor

● Great, you are already here!● Training● Documentation● Team your team with a world-class

support and consulting organizations

High-level considerations

Advantages of Homogeneity

● Eases system administration

● Components are interchangeable

● Lower purchasing costs

● Standardized ordering process

Murphy’s Law, 2016 version

● “At scale, everything fails.”

● Distributed systems protect against individual failures causing service failures by eliminating Single Points of Failure

● Distributed systems are still vulnerable to correlated failures

Advantages of Heterogeneity

Everything is broken …

… but everything is broken differently

Homogeneity is non-sustainable

● Hardware gets replaced

– Replacement with same model not available, or

– not desirable given current prices

● Software updates are not (yet) globally immediate

● Requirements change

● Your cluster ends up being heterogeneous anyway

● … you might as well benefit from it.

Failure is inevitable; suffering is optional

● If you want uptime, prepare for downtime

● Architect your system to survive a single or multiple failures

● Test whether the system meets your SLA

– while degraded and during recovery!

How much availability do you need?

● Availability and durability are not free

● Cost, Complexity increase exponentially

● Scale out makes some things easier

A bag of suggestions

Embrace diversity

● Automatic recovery requires a >50% majority

– Splitting into multiple different categories/models

– Feasible for some components

– Multiple architectures?

– Mix them across different racks/pods

● A 50:50 split still allows manual recovery in case of catastrophic failures

– Different UPS and power circuits

Hardware choices● SUSE offers Reference Architectures:

– e.g., Lenovo, HPE, Cisco, Dell

● Partners offer turn-key solutions

– e.g., HPE, Thomas-Krenn

● SUSE Yes certification reduces risk

– https://www.suse.com/newsroom/post/2016/suse-extends-partner-software-certification-for-cloud-and-storage-customers/

● Small variations can have a huge impact!

Not all the eggs in one basket^Wrack● Distribute servers physically to limit the impact of power outages,

spills, …

● Ceph’s CRUSH map allows you to describe the physical topology of your fault domains (engineering speak for “availability zones”)

How many MONitors do I need?

To converge roles or not

● “Hyper converged” equals correlated failures

● It does drive down cost of implementation

● Sizing becomes less deterministic

● Services might recover at the same time

● At scale, don’t correlate the MONs and OSDs

Storage diversity

● Avoid desktop HDDs

● Avoid sequential serial numbers

● Mount at different angles if paranoid

● Multiple vendors

● Avoid desktop SSDs

● Monitor wear-leveling

● Remember the journals see all writes

Storage Node Sizing

● Node failures most common granularity

– Admin mistake, network, kernel crash

● Consider impact of outage on:

– Performance (degraded and recovery)

– and capacity!

● A single node should not be more than 10% of your total capacity

● Free capacity should be larger than largest node

Data availability and durability

● Replication:– Number of copies

– Linear overhead

● Erasure Coding:– Flexible number of data and coding blocks

– Can survive any number of outages

– Fractional overhead– https://www.youtube.com/watch?v=-KyGv6AZN9M

Durability: Three-way Replication

Usable capacity: 33%Durability: 2 faults

Durability: 4+3 Erasure Coding

Usable capacity: 57%Durability: 3 faults

Consider Cache Tiering

● Data in cache tier is replicated

● Backing tier may be slower, but more durable

Durability 201

● Different strokes for different pools

● Erasure coding schemes galore

Finding and correcting bad data

● Ceph “scrubbing” detects inconsistent or missing placement groups periodically

http://ceph.com/planet/ceph-manually-repair-object/http://docs.ceph.com/docs/jewel/rados/configuration/osd-config-ref/#scrubbing

● SUSE Enterprise Storage 5 will validate checksums on every read

Automatic fault detection and recovery

● Do you want this in your cluster?

● Consider setting “noout”:– during maintenance windows

– in small clusters

Network considerations● Have both the public and cluster network bonded

● Consider different NICs

– Use last year’s NICs and switches

● One channel from each network to each switch

Gateway considerations● RadosGW (S3/Swift):

– Use HTTP/TCP load balancers

– Possible to build using SLE HA with LVS or haproxy

● iSCSI targets:

– Multiple gateways, natively supported by iSCSI

● Improves availability and throughput

– Make sure you meet your performance SLAs during degraded modes

Avoid configuration drift● Ensure that systems are configured consistently

– Installed packages

– Package versions

– Configuration (NTP, logging, passwords, …)

● Avoid manual configuration

● Use Salt instead

http://ourobengr.com/2016/11/hello-salty-goodness/

https://www.suse.com/communities/blog/managing-configuration-drift-salt-snapper/

Trust but verify a.k.a. monitoring

● Performance as the system ages

● SSD degradation / wear leveling

● Capacity utilization

● “Free” capacity is usable for recovery

● React to issues in a timely fashion!

Update, always (but with care)

● Updates are good for your system

– Security

– Performance

– Stability

● Ceph remains available even while updates are being rolled out

● SUSE’s tested maintenance updates are the main product value

Trust nobody (not even SUSE)

● If you at all possibly can, use a staging system

– Ideally: a (reduced) version of your production environment

– At least: a virtualized environment

● Test updates before rolling them out in production

– Not just code, but also processes!

● Long-term maintainability:

– Avoid vendor lock-in, use Open Source

Disaster can will strike

● Does it matter?

● If it does:

– Backups

– Replicate to other sites● rbd-mirror, radosgw multi-site

● Have fire drills!

Avoid complexity (KISS)

● Be aggressive in what you test– Test all the features

● Be conservative in what you deploy– Deploy only what you need

In conclusion

Don’t panic.

SUSE’s here to help.

Building reliable Ceph clusters with SUSE Enterprise Storage

Software

Transcript of Building reliable Ceph clusters with SUSE Enterprise Storage

Unleash the Power of Ceph Across the Data Center · Unleash the Power of Ceph Across the Data Center TUT18972: FC/iSCSI for Ceph ... Xen Integration 2013 SUSE Enterprise Storage 2.0

SUSE OpenStack and Ceph integration - Open Source ... · SUSE ® OpenStack and Ceph integration ... Customer Center ... ‒Swift-compatible – subset of the OpenStack Swift API

Ceph @ CERN · 2018-11-27 · Ceph Clusters in CERN IT 8 CERN Ceph Clusters Size Version OpenStack Cinder/Glance Production 6.2PB luminous Satellite data centre (1000km away) 1.6PB

eploying Ceph with igh erformance Networks Architectures ... · Ceph Storage Clusters are dynamic– like a living organism. Whereas, many storage appliances do not fully utilize

Deployment Guide - SUSE Enterprise Storage 6 · 2020-05-12 · Contents About This Guidex I SUSE ENTERPRISE STORAGE 1 1 SUSE Enterprise Storage 6 and Ceph2 1.1 Ceph Features 2 1.2

April 2017 - Colfax Direct · System configuration: Common –2 x 5-node Ceph clusters both on Ceph BlueStore Kraken release 11.0.2, each node with Ubuntu 16.04 updated to Linux kernel

Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware

SUSE Enterprise Storage on HPE Apollo 4200/4500 System Servers · 2016-07-28 · SUSE Enterprise Storage architecture—powered by Ceph A SUSE Enterprise Storage cluster is a software

Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES · Quick-and-Easy Deployment of a Ceph Storage Cluster with SLES With a look at SUSE Studio, Manager and Build Service

Docker + Ceph = Happiness - SUSE · 3 Containerization Overview Massive resource consolidation (~ 5X). Rapid and consistent deployments. Bridge gap between developers and operations.

SUSE OpenStack Cloud 4 - Novell Magyarország...2015/02/19 · 25 Highlights of What's New in SUSE OpenStack Cloud 4 (1/2) • Full support for Ceph distributed storage system •

Geek Guide > SUSE Enterprise Storage 4 · GEEK GUIDE f SUSE ETERPRISE STORAGE 4 5 Introduction I wrote a previous Geek Guide, titled Ceph: Open-Source SDS, that briefly introduced

High Performance Computing - · PDF file® High Performance Computing ... ‒ SSD optimization ... File Systems – CEPH SUSE High Performance Computing. 18

Building High Availability Clusters with SUSE Linux Enterprise High Availability Extension

導入ガイド - SUSE Enterprise Storage 6€¦ · 目次 このガイドについて x I SUSE ENTERPRISE STORAGE 1 1 SUSE Enterprise Storage 6とCeph 2 1.1 Cephの特徴 2 1.2 コアコンポーネント

documentation.suse.com€¦ · Sumário Sobre este guia x I SUSE ENTERPRISE STORAGE1 1 SUSE Enterprise Storage 6 e Ceph 2 1.1 Recursos do Ceph2 1.2 Componentes básicos3 RADOS 3 •

Ceph and Storage Management with openATTIC - SUSE MOST - 2016-06-07

Optimized Supermicro Storage Clusters · Optimized Supermicro Storage Clusters . SuperStorage Servers Deploying Red Hat Ceph Storage are Optimized for Best Throughput and Capacity.

SUSE Enterprise Storage v5 · SUSE Enterprise Storage provides unified block, file and object access based on Ceph. Ceph is a distributed storage solution designed for scalability,

Geek Guide > Ceph: Open-Source SDS - SUSE Linux · PDF fileGEEK GUIDE f CEPH: OPE-SOURCE SDS 2 ... data comes with increased costs and performance issues. ... of capacity optimization,

導入ガイド - SUSE Enterprise Storage 6€¦ · 目次このガイドについて x I SUSE ENTERPRISE STORAGE 1 1 SUSE Enterprise Storage 6とCeph 2 1.1 Cephの特徴 2 1.2 コアコンポーネント