Post on 13-Apr-2017
Building reliable Ceph clusters with SUSE Enterprise Storage
Survival skills for the real world
Lars Marowsky-BréeDistinguished Engineerlmb@suse.com
What this talk is not
● A comprehensive introduction to Ceph
● SUSE Enterprise Storage roadmap session
● A discussion of Ceph performance tuning
2
SUSE Enterprise Storage - Reprise
3
The Ceph project● An Open Source Software-Defined-Storage project
● Multiple front-ends
– S3/Swift object interface
– Native Linux block IO
– Heterogeneous Block IO (iSCSI)
– Native Linux network file system (CephFS)
– Heterogeneous Network File System (nfs-ganesha)
– Low-level, C++/Python/… libraries
– Linux, UNIX, Windows, Applications, Cloud, Containers
● Common, smart data store (RADOS)
– Pseudo-random, algorithmic data distribution
4
Software-Defined-Storage
Ceph Cluster: Logical View
6
MON
MON
MON
MDS
MDS
OSDOSDOSD
OSD OSD OSD
iSCSIGateway
iSCSIGateway
iSCSIGateway
S3/SwiftGateway
S3/SwiftGateway
NFSGateway
RA
DO
S
Introducing Dependability
7
Introducing dependability
● Availability
● Reliability
– Durability
● Safety
● Maintainability
8
The elephant in the room
● Before we discuss technology ...
● … guess what causes most outages?
9
Improve your human factor
● Great, you are already here!● Training● Documentation● Team your team with a world-class
support and consulting organizations
10
High-level considerations
11
Advantages of Homogeneity
● Eases system administration
● Components are interchangeable
● Lower purchasing costs
● Standardized ordering process
12
Murphy’s Law, 2016 version
● “At scale, everything fails.”
● Distributed systems protect against individual failures causing service failures by eliminating Single Points of Failure
● Distributed systems are still vulnerable to correlated failures
13
2n+1
Advantages of Heterogeneity
Everything is broken …
… but everything is broken differently
14
Homogeneity is non-sustainable
● Hardware gets replaced
– Replacement with same model not available, or
– not desirable given current prices
● Software updates are not (yet) globally immediate
● Requirements change
● Your cluster ends up being heterogeneous anyway
● … you might as well benefit from it.
15
Failure is inevitable; suffering is optional
● If you want uptime, prepare for downtime
● Architect your system to survive a single or multiple failures
● Test whether the system meets your SLA
– while degraded and during recovery!
16
How much availability do you need?
● Availability and durability are not free
● Cost, Complexity increase exponentially
● Scale out makes some things easier
17
A bag of suggestions
18
Embrace diversity
● Automatic recovery requires a >50% majority
– Splitting into multiple different categories/models
– Feasible for some components
– Multiple architectures?
– Mix them across different racks/pods
● A 50:50 split still allows manual recovery in case of catastrophic failures
– Different UPS and power circuits
19
Hardware choices● SUSE offers Reference Architectures:
– e.g., Lenovo, HPE, Cisco, Dell
● Partners offer turn-key solutions
– e.g., HPE, Thomas-Krenn
● SUSE Yes certification reduces risk
– https://www.suse.com/newsroom/post/2016/suse-extends-partner-software-certification-for-cloud-and-storage-customers/
● Small variations can have a huge impact!
20
Not all the eggs in one basket^Wrack● Distribute servers physically to limit the impact of power outages,
spills, …
● Ceph’s CRUSH map allows you to describe the physical topology of your fault domains (engineering speak for “availability zones”)
21
How many MONitors do I need?
22
2n+1
To converge roles or not
● “Hyper converged” equals correlated failures
● It does drive down cost of implementation
● Sizing becomes less deterministic
● Services might recover at the same time
● At scale, don’t correlate the MONs and OSDs
23
Storage diversity
2424
● Avoid desktop HDDs
● Avoid sequential serial numbers
● Mount at different angles if paranoid
● Multiple vendors
● Avoid desktop SSDs
● Monitor wear-leveling
● Remember the journals see all writes
Storage Node Sizing
● Node failures most common granularity
– Admin mistake, network, kernel crash
● Consider impact of outage on:
– Performance (degraded and recovery)
– and capacity!
● A single node should not be more than 10% of your total capacity
● Free capacity should be larger than largest node
25
Data availability and durability
● Replication:– Number of copies
– Linear overhead
● Erasure Coding:– Flexible number of data and coding blocks
– Can survive any number of outages
– Fractional overhead– https://www.youtube.com/watch?v=-KyGv6AZN9M
26
k+mk
2n+1
Durability: Three-way Replication
27
Usable capacity: 33%Durability: 2 faults
Durability: 4+3 Erasure Coding
28
Usable capacity: 57%Durability: 3 faults
Consider Cache Tiering
● Data in cache tier is replicated
● Backing tier may be slower, but more durable
29
Durability 201
● Different strokes for different pools
● Erasure coding schemes galore
30
Finding and correcting bad data
● Ceph “scrubbing” detects inconsistent or missing placement groups periodically
http://ceph.com/planet/ceph-manually-repair-object/http://docs.ceph.com/docs/jewel/rados/configuration/osd-config-ref/#scrubbing
● SUSE Enterprise Storage 5 will validate checksums on every read
31
Automatic fault detection and recovery
● Do you want this in your cluster?
● Consider setting “noout”:– during maintenance windows
– in small clusters
32
Network considerations● Have both the public and cluster network bonded
● Consider different NICs
– Use last year’s NICs and switches
● One channel from each network to each switch
33
Gateway considerations● RadosGW (S3/Swift):
– Use HTTP/TCP load balancers
– Possible to build using SLE HA with LVS or haproxy
● iSCSI targets:
– Multiple gateways, natively supported by iSCSI
● Improves availability and throughput
– Make sure you meet your performance SLAs during degraded modes
34
Avoid configuration drift● Ensure that systems are configured consistently
– Installed packages
– Package versions
– Configuration (NTP, logging, passwords, …)
● Avoid manual configuration
● Use Salt instead
http://ourobengr.com/2016/11/hello-salty-goodness/
https://www.suse.com/communities/blog/managing-configuration-drift-salt-snapper/
35
Trust but verify a.k.a. monitoring
● Performance as the system ages
● SSD degradation / wear leveling
● Capacity utilization
● “Free” capacity is usable for recovery
● React to issues in a timely fashion!
36
Update, always (but with care)
● Updates are good for your system
– Security
– Performance
– Stability
● Ceph remains available even while updates are being rolled out
● SUSE’s tested maintenance updates are the main product value
37
Trust nobody (not even SUSE)
● If you at all possibly can, use a staging system
– Ideally: a (reduced) version of your production environment
– At least: a virtualized environment
● Test updates before rolling them out in production
– Not just code, but also processes!
● Long-term maintainability:
– Avoid vendor lock-in, use Open Source
38
Disaster can will strike
● Does it matter?
● If it does:
– Backups
– Replicate to other sites● rbd-mirror, radosgw multi-site
● Have fire drills!
39
Avoid complexity (KISS)
● Be aggressive in what you test– Test all the features
● Be conservative in what you deploy– Deploy only what you need
40
In conclusion
Don’t panic.
SUSE’s here to help.
41