Building reliable Ceph clusters with SUSE Enterprise Storage

42
Building reliable Ceph clusters with SUSE Enterprise Storage Survival skills for the real world Lars Marowsky-Brée Distinguished Engineer [email protected]

Transcript of Building reliable Ceph clusters with SUSE Enterprise Storage

Page 1: Building reliable Ceph clusters with SUSE Enterprise Storage

Building reliable Ceph clusters with SUSE Enterprise Storage

Survival skills for the real world

Lars Marowsky-BréeDistinguished [email protected]

Page 2: Building reliable Ceph clusters with SUSE Enterprise Storage

What this talk is not

● A comprehensive introduction to Ceph

● SUSE Enterprise Storage roadmap session

● A discussion of Ceph performance tuning

2

Page 3: Building reliable Ceph clusters with SUSE Enterprise Storage

SUSE Enterprise Storage - Reprise

3

Page 4: Building reliable Ceph clusters with SUSE Enterprise Storage

The Ceph project● An Open Source Software-Defined-Storage project

● Multiple front-ends

– S3/Swift object interface

– Native Linux block IO

– Heterogeneous Block IO (iSCSI)

– Native Linux network file system (CephFS)

– Heterogeneous Network File System (nfs-ganesha)

– Low-level, C++/Python/… libraries

– Linux, UNIX, Windows, Applications, Cloud, Containers

● Common, smart data store (RADOS)

– Pseudo-random, algorithmic data distribution

4

Page 5: Building reliable Ceph clusters with SUSE Enterprise Storage

Software-Defined-Storage

Page 6: Building reliable Ceph clusters with SUSE Enterprise Storage

Ceph Cluster: Logical View

6

MON

MON

MON

MDS

MDS

OSDOSDOSD

OSD OSD OSD

iSCSIGateway

iSCSIGateway

iSCSIGateway

S3/SwiftGateway

S3/SwiftGateway

NFSGateway

RA

DO

S

Page 7: Building reliable Ceph clusters with SUSE Enterprise Storage

Introducing Dependability

7

Page 8: Building reliable Ceph clusters with SUSE Enterprise Storage

Introducing dependability

● Availability

● Reliability

– Durability

● Safety

● Maintainability

8

Page 9: Building reliable Ceph clusters with SUSE Enterprise Storage

The elephant in the room

● Before we discuss technology ...

● … guess what causes most outages?

9

Page 10: Building reliable Ceph clusters with SUSE Enterprise Storage

Improve your human factor

● Great, you are already here!● Training● Documentation● Team your team with a world-class

support and consulting organizations

10

Page 11: Building reliable Ceph clusters with SUSE Enterprise Storage

High-level considerations

11

Page 12: Building reliable Ceph clusters with SUSE Enterprise Storage

Advantages of Homogeneity

● Eases system administration

● Components are interchangeable

● Lower purchasing costs

● Standardized ordering process

12

Page 13: Building reliable Ceph clusters with SUSE Enterprise Storage

Murphy’s Law, 2016 version

● “At scale, everything fails.”

● Distributed systems protect against individual failures causing service failures by eliminating Single Points of Failure

● Distributed systems are still vulnerable to correlated failures

13

2n+1

Page 14: Building reliable Ceph clusters with SUSE Enterprise Storage

Advantages of Heterogeneity

Everything is broken …

… but everything is broken differently

14

Page 15: Building reliable Ceph clusters with SUSE Enterprise Storage

Homogeneity is non-sustainable

● Hardware gets replaced

– Replacement with same model not available, or

– not desirable given current prices

● Software updates are not (yet) globally immediate

● Requirements change

● Your cluster ends up being heterogeneous anyway

● … you might as well benefit from it.

15

Page 16: Building reliable Ceph clusters with SUSE Enterprise Storage

Failure is inevitable; suffering is optional

● If you want uptime, prepare for downtime

● Architect your system to survive a single or multiple failures

● Test whether the system meets your SLA

– while degraded and during recovery!

16

Page 17: Building reliable Ceph clusters with SUSE Enterprise Storage

How much availability do you need?

● Availability and durability are not free

● Cost, Complexity increase exponentially

● Scale out makes some things easier

17

Page 18: Building reliable Ceph clusters with SUSE Enterprise Storage

A bag of suggestions

18

Page 19: Building reliable Ceph clusters with SUSE Enterprise Storage

Embrace diversity

● Automatic recovery requires a >50% majority

– Splitting into multiple different categories/models

– Feasible for some components

– Multiple architectures?

– Mix them across different racks/pods

● A 50:50 split still allows manual recovery in case of catastrophic failures

– Different UPS and power circuits

19

Page 20: Building reliable Ceph clusters with SUSE Enterprise Storage

Hardware choices● SUSE offers Reference Architectures:

– e.g., Lenovo, HPE, Cisco, Dell

● Partners offer turn-key solutions

– e.g., HPE, Thomas-Krenn

● SUSE Yes certification reduces risk

– https://www.suse.com/newsroom/post/2016/suse-extends-partner-software-certification-for-cloud-and-storage-customers/

● Small variations can have a huge impact!

20

Page 21: Building reliable Ceph clusters with SUSE Enterprise Storage

Not all the eggs in one basket^Wrack● Distribute servers physically to limit the impact of power outages,

spills, …

● Ceph’s CRUSH map allows you to describe the physical topology of your fault domains (engineering speak for “availability zones”)

21

Page 22: Building reliable Ceph clusters with SUSE Enterprise Storage

How many MONitors do I need?

22

2n+1

Page 23: Building reliable Ceph clusters with SUSE Enterprise Storage

To converge roles or not

● “Hyper converged” equals correlated failures

● It does drive down cost of implementation

● Sizing becomes less deterministic

● Services might recover at the same time

● At scale, don’t correlate the MONs and OSDs

23

Page 24: Building reliable Ceph clusters with SUSE Enterprise Storage

Storage diversity

2424

● Avoid desktop HDDs

● Avoid sequential serial numbers

● Mount at different angles if paranoid

● Multiple vendors

● Avoid desktop SSDs

● Monitor wear-leveling

● Remember the journals see all writes

Page 25: Building reliable Ceph clusters with SUSE Enterprise Storage

Storage Node Sizing

● Node failures most common granularity

– Admin mistake, network, kernel crash

● Consider impact of outage on:

– Performance (degraded and recovery)

– and capacity!

● A single node should not be more than 10% of your total capacity

● Free capacity should be larger than largest node

25

Page 26: Building reliable Ceph clusters with SUSE Enterprise Storage

Data availability and durability

● Replication:– Number of copies

– Linear overhead

● Erasure Coding:– Flexible number of data and coding blocks

– Can survive any number of outages

– Fractional overhead– https://www.youtube.com/watch?v=-KyGv6AZN9M

26

k+mk

2n+1

Page 27: Building reliable Ceph clusters with SUSE Enterprise Storage

Durability: Three-way Replication

27

Usable capacity: 33%Durability: 2 faults

Page 28: Building reliable Ceph clusters with SUSE Enterprise Storage

Durability: 4+3 Erasure Coding

28

Usable capacity: 57%Durability: 3 faults

Page 29: Building reliable Ceph clusters with SUSE Enterprise Storage

Consider Cache Tiering

● Data in cache tier is replicated

● Backing tier may be slower, but more durable

29

Page 30: Building reliable Ceph clusters with SUSE Enterprise Storage

Durability 201

● Different strokes for different pools

● Erasure coding schemes galore

30

Page 31: Building reliable Ceph clusters with SUSE Enterprise Storage

Finding and correcting bad data

● Ceph “scrubbing” detects inconsistent or missing placement groups periodically

http://ceph.com/planet/ceph-manually-repair-object/http://docs.ceph.com/docs/jewel/rados/configuration/osd-config-ref/#scrubbing

● SUSE Enterprise Storage 5 will validate checksums on every read

31

Page 32: Building reliable Ceph clusters with SUSE Enterprise Storage

Automatic fault detection and recovery

● Do you want this in your cluster?

● Consider setting “noout”:– during maintenance windows

– in small clusters

32

Page 33: Building reliable Ceph clusters with SUSE Enterprise Storage

Network considerations● Have both the public and cluster network bonded

● Consider different NICs

– Use last year’s NICs and switches

● One channel from each network to each switch

33

Page 34: Building reliable Ceph clusters with SUSE Enterprise Storage

Gateway considerations● RadosGW (S3/Swift):

– Use HTTP/TCP load balancers

– Possible to build using SLE HA with LVS or haproxy

● iSCSI targets:

– Multiple gateways, natively supported by iSCSI

● Improves availability and throughput

– Make sure you meet your performance SLAs during degraded modes

34

Page 35: Building reliable Ceph clusters with SUSE Enterprise Storage

Avoid configuration drift● Ensure that systems are configured consistently

– Installed packages

– Package versions

– Configuration (NTP, logging, passwords, …)

● Avoid manual configuration

● Use Salt instead

http://ourobengr.com/2016/11/hello-salty-goodness/

https://www.suse.com/communities/blog/managing-configuration-drift-salt-snapper/

35

Page 36: Building reliable Ceph clusters with SUSE Enterprise Storage

Trust but verify a.k.a. monitoring

● Performance as the system ages

● SSD degradation / wear leveling

● Capacity utilization

● “Free” capacity is usable for recovery

● React to issues in a timely fashion!

36

Page 37: Building reliable Ceph clusters with SUSE Enterprise Storage

Update, always (but with care)

● Updates are good for your system

– Security

– Performance

– Stability

● Ceph remains available even while updates are being rolled out

● SUSE’s tested maintenance updates are the main product value

37

Page 38: Building reliable Ceph clusters with SUSE Enterprise Storage

Trust nobody (not even SUSE)

● If you at all possibly can, use a staging system

– Ideally: a (reduced) version of your production environment

– At least: a virtualized environment

● Test updates before rolling them out in production

– Not just code, but also processes!

● Long-term maintainability:

– Avoid vendor lock-in, use Open Source

38

Page 39: Building reliable Ceph clusters with SUSE Enterprise Storage

Disaster can will strike

● Does it matter?

● If it does:

– Backups

– Replicate to other sites● rbd-mirror, radosgw multi-site

● Have fire drills!

39

Page 40: Building reliable Ceph clusters with SUSE Enterprise Storage

Avoid complexity (KISS)

● Be aggressive in what you test– Test all the features

● Be conservative in what you deploy– Deploy only what you need

40

Page 41: Building reliable Ceph clusters with SUSE Enterprise Storage

In conclusion

Don’t panic.

SUSE’s here to help.

41

Page 42: Building reliable Ceph clusters with SUSE Enterprise Storage