Building reliable Ceph clusters with SUSE Enterprise Storage

Building reliable Ceph clusters with SUSE Enterprise Storage

Survival skills for the real world

Lars Marowsky-BréeDistinguished [email protected]

What this talk is not

● A comprehensive introduction to Ceph

● SUSE Enterprise Storage roadmap session

● A discussion of Ceph performance tuning

2

SUSE Enterprise Storage - Reprise

3

The Ceph project● An Open Source Software-Defined-Storage project

● Multiple front-ends

– S3/Swift object interface

– Native Linux block IO

– Heterogeneous Block IO (iSCSI)

– Native Linux network file system (CephFS)

– Heterogeneous Network File System (nfs-ganesha)

– Low-level, C++/Python/… libraries

– Linux, UNIX, Windows, Applications, Cloud, Containers

● Common, smart data store (RADOS)

– Pseudo-random, algorithmic data distribution

4

Software-Defined-Storage

Ceph Cluster: Logical View

6

MON

MON

MON

MDS

MDS

OSDOSDOSD

OSD OSD OSD

iSCSIGateway

iSCSIGateway

iSCSIGateway

S3/SwiftGateway

S3/SwiftGateway

NFSGateway

RA

DO

S

Introducing Dependability

7

Introducing dependability

● Availability

● Reliability

– Durability

● Safety

● Maintainability

8

The elephant in the room

● Before we discuss technology ...

● … guess what causes most outages?

9

Improve your human factor

● Great, you are already here!● Training● Documentation● Team your team with a world-class

support and consulting organizations

10

High-level considerations

11

Advantages of Homogeneity

● Eases system administration

● Components are interchangeable

● Lower purchasing costs

● Standardized ordering process

12

Murphy’s Law, 2016 version

● “At scale, everything fails.”

● Distributed systems protect against individual failures causing service failures by eliminating Single Points of Failure

● Distributed systems are still vulnerable to correlated failures

13

2n+1

Advantages of Heterogeneity

Everything is broken …

… but everything is broken differently

14

Homogeneity is non-sustainable

● Hardware gets replaced

– Replacement with same model not available, or

– not desirable given current prices

● Software updates are not (yet) globally immediate

● Requirements change

● Your cluster ends up being heterogeneous anyway

● … you might as well benefit from it.

15

Failure is inevitable; suffering is optional

● If you want uptime, prepare for downtime

● Architect your system to survive a single or multiple failures

● Test whether the system meets your SLA

– while degraded and during recovery!

16

How much availability do you need?

● Availability and durability are not free

● Cost, Complexity increase exponentially

● Scale out makes some things easier

17

A bag of suggestions

18

Embrace diversity

● Automatic recovery requires a >50% majority

– Splitting into multiple different categories/models

– Feasible for some components

– Multiple architectures?

– Mix them across different racks/pods

● A 50:50 split still allows manual recovery in case of catastrophic failures

– Different UPS and power circuits

19

Hardware choices● SUSE offers Reference Architectures:

– e.g., Lenovo, HPE, Cisco, Dell

● Partners offer turn-key solutions

– e.g., HPE, Thomas-Krenn

● SUSE Yes certification reduces risk

– https://www.suse.com/newsroom/post/2016/suse-extends-partner-software-certification-for-cloud-and-storage-customers/

● Small variations can have a huge impact!

20

Not all the eggs in one basket^Wrack● Distribute servers physically to limit the impact of power outages,

spills, …

● Ceph’s CRUSH map allows you to describe the physical topology of your fault domains (engineering speak for “availability zones”)

21

How many MONitors do I need?

22

2n+1

To converge roles or not

● “Hyper converged” equals correlated failures

● It does drive down cost of implementation

● Sizing becomes less deterministic

● Services might recover at the same time

● At scale, don’t correlate the MONs and OSDs

23

Storage diversity

2424

● Avoid desktop HDDs

● Avoid sequential serial numbers

● Mount at different angles if paranoid

● Multiple vendors

● Avoid desktop SSDs

● Monitor wear-leveling

● Remember the journals see all writes

Storage Node Sizing

● Node failures most common granularity

– Admin mistake, network, kernel crash

● Consider impact of outage on:

– Performance (degraded and recovery)

– and capacity!

● A single node should not be more than 10% of your total capacity

● Free capacity should be larger than largest node

25

Data availability and durability

● Replication:– Number of copies

– Linear overhead

● Erasure Coding:– Flexible number of data and coding blocks

– Can survive any number of outages

– Fractional overhead– https://www.youtube.com/watch?v=-KyGv6AZN9M

26

k+mk

2n+1

Durability: Three-way Replication

27

Usable capacity: 33%Durability: 2 faults

Durability: 4+3 Erasure Coding

28

Usable capacity: 57%Durability: 3 faults

Consider Cache Tiering

● Data in cache tier is replicated

● Backing tier may be slower, but more durable

29

Durability 201

● Different strokes for different pools

● Erasure coding schemes galore

30

Finding and correcting bad data

● Ceph “scrubbing” detects inconsistent or missing placement groups periodically

http://ceph.com/planet/ceph-manually-repair-object/http://docs.ceph.com/docs/jewel/rados/configuration/osd-config-ref/#scrubbing

● SUSE Enterprise Storage 5 will validate checksums on every read

31

http://docs.ceph.com/docs/jewel/rados/configuration/osd-config-ref/#scrubbing

Automatic fault detection and recovery

● Do you want this in your cluster?

● Consider setting “noout”:– during maintenance windows

– in small clusters

32

Network considerations● Have both the public and cluster network bonded

● Consider different NICs

– Use last year’s NICs and switches

● One channel from each network to each switch

33

Gateway considerations● RadosGW (S3/Swift):

– Use HTTP/TCP load balancers

– Possible to build using SLE HA with LVS or haproxy

● iSCSI targets:

– Multiple gateways, natively supported by iSCSI

● Improves availability and throughput

– Make sure you meet your performance SLAs during degraded modes

34

Avoid configuration drift● Ensure that systems are configured consistently

– Installed packages

– Package versions

– Configuration (NTP, logging, passwords, …)

● Avoid manual configuration

● Use Salt instead

http://ourobengr.com/2016/11/hello-salty-goodness/

https://www.suse.com/communities/blog/managing-configuration-drift-salt-snapper/

35

Trust but verify a.k.a. monitoring

● Performance as the system ages

● SSD degradation / wear leveling

● Capacity utilization

● “Free” capacity is usable for recovery

● React to issues in a timely fashion!

36

Update, always (but with care)

● Updates are good for your system

– Security

– Performance

– Stability

● Ceph remains available even while updates are being rolled out

● SUSE’s tested maintenance updates are the main product value

37

Trust nobody (not even SUSE)

● If you at all possibly can, use a staging system

– Ideally: a (reduced) version of your production environment

– At least: a virtualized environment

● Test updates before rolling them out in production

– Not just code, but also processes!

● Long-term maintainability:

– Avoid vendor lock-in, use Open Source

38

Disaster can will strike

● Does it matter?

● If it does:

– Backups

– Replicate to other sites● rbd-mirror, radosgw multi-site

● Have fire drills!

39

Avoid complexity (KISS)

● Be aggressive in what you test– Test all the features

● Be conservative in what you deploy– Deploy only what you need

40

In conclusion

Don’t panic.

SUSE’s here to help.

41

Building reliable Ceph clusters with SUSE Enterprise Storage

Software

Transcript of Building reliable Ceph clusters with SUSE Enterprise Storage